NanoBEIREvaluator¶
Evaluate the performance of a PyLate Model on the NanoBEIR collection.
This is a direct extension of the NanoBEIREvaluator from the sentence-transformers library, leveraging the PyLateInformationRetrievalEvaluator class. The collection is a set of datasets based on the BEIR collection, but with a significantly smaller size, so it can be used for quickly evaluating the retrieval performance of a model before committing to a full evaluation. The Evaluator will return the same metrics as the InformationRetrievalEvaluator (i.e., MRR, nDCG, Recall@k), for each dataset and on average.
Parameters¶
-
dataset_names ('list[DatasetNameType] | None') – defaults to
None -
mrr_at_k ('list[int]') – defaults to
[10] -
ndcg_at_k ('list[int]') – defaults to
[10] -
accuracy_at_k ('list[int]') – defaults to
[1, 3, 5, 10] -
precision_recall_at_k ('list[int]') – defaults to
[1, 3, 5, 10] -
map_at_k ('list[int]') – defaults to
[100] -
show_progress_bar ('bool') – defaults to
False -
batch_size ('int') – defaults to
32 -
write_csv ('bool') – defaults to
True -
truncate_dim ('int | None') – defaults to
None -
score_functions ('dict[str, Callable[[Tensor, Tensor], Tensor]] | None') – defaults to
None -
main_score_function ('str | SimilarityFunction | None') – defaults to
None -
aggregate_fn ('Callable[[list[float]], float]') – defaults to
<function mean at 0x7f2005569090> -
aggregate_key ('str') – defaults to
mean -
query_prompts ('str | dict[str, str] | None') – defaults to
None -
corpus_prompts ('str | dict[str, str] | None') – defaults to
None -
write_predictions ('bool') – defaults to
False
Attributes¶
-
description
Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification 1. Replace "CE" prefix with "CrossEncoder" 2. Remove "Evaluator" from the class name 3. Add a space before every capital letter
Examples¶
>>> from pylate import models, evaluation
>>> model = models.ColBERT(
... model_name_or_path="lightonai/colbertv2.0"
... )
>>> datasets = ["SciFact"]
>>> evaluator = evaluation.NanoBEIREvaluator(
... dataset_names=datasets
... )
evaluator(model) {'NanoSciFact_MaxSim_accuracy@1': 0.62, 'NanoSciFact_MaxSim_accuracy@3': 0.74, 'NanoSciFact_MaxSim_accuracy@5': 0.8, 'NanoSciFact_MaxSim_accuracy@10': 0.86, 'NanoSciFact_MaxSim_precision@1': 0.62, 'NanoSciFact_MaxSim_precision@3': 0.26666666666666666, 'NanoSciFact_MaxSim_precision@5': 0.18, 'NanoSciFact_MaxSim_precision@10': 0.096, 'NanoSciFact_MaxSim_recall@1': 0.595, 'NanoSciFact_MaxSim_recall@3': 0.715, 'NanoSciFact_MaxSim_recall@5': 0.79, 'NanoSciFact_MaxSim_recall@10': 0.85, 'NanoSciFact_MaxSim_ndcg@10': 0.7279903941189909, 'NanoSciFact_MaxSim_mrr@10': 0.6912222222222222, 'NanoSciFact_MaxSim_map@100': 0.6903374780806633, 'NanoBEIR_mean_MaxSim_accuracy@1': 0.62, 'NanoBEIR_mean_MaxSim_accuracy@3': 0.74, 'NanoBEIR_mean_MaxSim_accuracy@5': 0.8, 'NanoBEIR_mean_MaxSim_accuracy@10': 0.86, 'NanoBEIR_mean_MaxSim_precision@1': 0.62, 'NanoBEIR_mean_MaxSim_precision@3': 0.26666666666666666, 'NanoBEIR_mean_MaxSim_precision@5': 0.18, 'NanoBEIR_mean_MaxSim_precision@10': 0.096, 'NanoBEIR_mean_MaxSim_recall@1': 0.595, 'NanoBEIR_mean_MaxSim_recall@3': 0.715, 'NanoBEIR_mean_MaxSim_recall@5': 0.79, 'NanoBEIR_mean_MaxSim_recall@10': 0.85, 'NanoBEIR_mean_MaxSim_ndcg@10': 0.7279903941189909, 'NanoBEIR_mean_MaxSim_mrr@10': 0.6912222222222222, 'NanoBEIR_mean_MaxSim_map@100': 0.6903374780806633}
Methods¶
call
This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result.
Args: model: the model to evaluate output_path: path where predictions and metrics are written to epoch: the epoch where the evaluation takes place. This is used for the file prefixes. If this is -1, then we assume evaluation on test data. steps: the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. Returns: Either a score for the evaluation with a higher score indicating a better result, or a dictionary with scores. If the latter is chosen, then evaluator.primary_metric must be defined
Parameters
- model ('SentenceTransformer')
- output_path ('str | None') – defaults to
None - epoch ('int') – defaults to
-1 - steps ('int') – defaults to
-1 - args
- kwargs
embed_inputs
Call the encoder method of the model pass
Args: model (SentenceTransformer): Model we are evaluating sentences (str | list[str] | np.ndarray): Text that we are embedding Returns: list[Tensor] | np.ndarray | Tensor | dict[str, Tensor] | list[dict[str, Tensor]]: The associated embedding
Parameters
- model ('SentenceTransformer')
- sentences ('str | list[str] | np.ndarray')
- kwargs
get_config_dict
Return a dictionary with all meaningful configuration values of the evaluator to store in the model card.
information_retrieval_class
This class evaluates an Information Retrieval (IR) setting.
Given a set of queries and a large corpus set. It will retrieve for each query the top-k most similar document. It measures Mean Reciprocal Rank (MRR), Recall@k, and Normalized Discounted Cumulative Gain (NDCG) Args: queries (Dict[str, str]): A dictionary mapping query IDs to queries. corpus (Dict[str, str]): A dictionary mapping document IDs to documents. relevant_docs (Dict[str, Set[str]]): A dictionary mapping query IDs to a set of relevant document IDs. corpus_chunk_size (int): The size of each chunk of the corpus. Defaults to 50000. mrr_at_k (List[int]): A list of integers representing the values of k for MRR calculation. Defaults to [10]. ndcg_at_k (List[int]): A list of integers representing the values of k for NDCG calculation. Defaults to [10]. accuracy_at_k (List[int]): A list of integers representing the values of k for accuracy calculation. Defaults to [1, 3, 5, 10]. precision_recall_at_k (List[int]): A list of integers representing the values of k for precision and recall calculation. Defaults to [1, 3, 5, 10]. map_at_k (List[int]): A list of integers representing the values of k for MAP calculation. Defaults to [100]. show_progress_bar (bool): Whether to show a progress bar during evaluation. Defaults to False. batch_size (int): The batch size for evaluation. Defaults to 32. name (str): A name for the evaluation. Defaults to "". write_csv (bool): Whether to write the evaluation results to a CSV file. Defaults to True. truncate_dim (int, optional): The dimension to truncate the embeddings to. Defaults to None. score_functions (Dict[str, Callable[[Tensor, Tensor], Tensor]]): A dictionary mapping score function names to score functions. Defaults to the similarity function from the model. main_score_function (Union[str, SimilarityFunction], optional): The main score function to use for evaluation. Defaults to None. query_prompt (str, optional): The prompt to be used when encoding the corpus. Defaults to None. query_prompt_name (str, optional): The name of the prompt to be used when encoding the corpus. Defaults to None. corpus_prompt (str, optional): The prompt to be used when encoding the corpus. Defaults to None. corpus_prompt_name (str, optional): The name of the prompt to be used when encoding the corpus. Defaults to None. write_predictions (bool): Whether to write the predictions to a JSONL file. Defaults to False. This can be useful for downstream evaluation as it can be used as input to the :class:~sentence_transformers.sparse_encoder.evaluation.ReciprocalRankFusionEvaluator that accept precomputed predictions. Example: :: import random from sentence_transformers import SentenceTransformer from sentence_transformers.evaluation import InformationRetrievalEvaluator from datasets import load_dataset # Load a model model = SentenceTransformer('all-MiniLM-L6-v2') # Load the Touche-2020 IR dataset (https://huggingface.co/datasets/BeIR/webis-touche2020, https://huggingface.co/datasets/BeIR/webis-touche2020-qrels) corpus = load_dataset("BeIR/webis-touche2020", "corpus", split="corpus") queries = load_dataset("BeIR/webis-touche2020", "queries", split="queries") relevant_docs_data = load_dataset("BeIR/webis-touche2020-qrels", split="test") # For this dataset, we want to concatenate the title and texts for the corpus corpus = corpus.map(lambda x: {'text': x['title'] + " " + x['text']}, remove_columns=['title']) # Shrink the corpus size heavily to only the relevant documents + 30,000 random documents required_corpus_ids = set(map(str, relevant_docs_data["corpus-id"])) required_corpus_ids |= set(random.sample(corpus["_id"], k=30_000)) corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids) # Convert the datasets to dictionaries corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (cid => document) queries = dict(zip(queries["_id"], queries["text"])) # Our queries (qid => question) relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]): qid = str(qid) corpus_ids = str(corpus_ids) if qid not in relevant_docs: relevant_docs[qid] = set() relevant_docs[qid].add(corpus_ids) # Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR metrics. ir_evaluator = InformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name="BeIR-touche2020-subset-test", ) results = ir_evaluator(model) ''' Information Retrieval Evaluation of the model on the BeIR-touche2020-test dataset: Queries: 49 Corpus: 31923 Score-Function: cosine Accuracy@1: 77.55% Accuracy@3: 93.88% Accuracy@5: 97.96% Accuracy@10: 100.00% Precision@1: 77.55% Precision@3: 72.11% Precision@5: 71.43% Precision@10: 62.65% Recall@1: 1.72% Recall@3: 4.78% Recall@5: 7.90% Recall@10: 13.86% MRR@10: 0.8580 NDCG@10: 0.6606 MAP@100: 0.2934 ''' print(ir_evaluator.primary_metric) # => "BeIR-touche2020-test_cosine_map@100" print(results[ir_evaluator.primary_metric]) # => 0.29335196224364596
Parameters
- queries ('dict[str, str]')
- corpus ('dict[str, str]')
- relevant_docs ('dict[str, set[str]]')
- corpus_chunk_size ('int') – defaults to
50000 - mrr_at_k ('list[int]') – defaults to
[10] - ndcg_at_k ('list[int]') – defaults to
[10] - accuracy_at_k ('list[int]') – defaults to
[1, 3, 5, 10] - precision_recall_at_k ('list[int]') – defaults to
[1, 3, 5, 10] - map_at_k ('list[int]') – defaults to
[100] - show_progress_bar ('bool') – defaults to
False - batch_size ('int') – defaults to
32 - name ('str') – defaults to ``
- write_csv ('bool') – defaults to
True - truncate_dim ('int | None') – defaults to
None - score_functions ('dict[str, Callable[[Tensor, Tensor], Tensor]] | None') – defaults to
None - main_score_function ('str | SimilarityFunction | None') – defaults to
None - query_prompt ('str | None') – defaults to
None - query_prompt_name ('str | None') – defaults to
None - corpus_prompt ('str | None') – defaults to
None - corpus_prompt_name ('str | None') – defaults to
None - write_predictions ('bool') – defaults to
False