Skip to content

Evaluation

Retrieval evaluation

This guide demonstrates an end-to-end pipeline to evaluate the performance of the ColBERT model on retrieval tasks. The pipeline involves three key steps: indexing documents, retrieving top-k documents for a given set of queries, and evaluating the retrieval results using standard metrics.

BEIR Retrieval Evaluation Pipeline

from pylate import evaluation, indexes, models, retrieve

# Step 1: Initialize the ColBERT model

dataset = "scifact" # Choose the dataset you want to evaluate
model = models.ColBERT(
    model_name_or_path="lightonai/colbertv2.0",
    device="cuda" # "cpu" or "cuda" or "mps"
)

# Step 2: Create a Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name=dataset,
    override=True,  # Overwrite any existing index
)

# Step 3: Load the documents, queries, and relevance judgments (qrels)
documents, queries, qrels = evaluation.load_beir(
    dataset,  # Specify the dataset (e.g., "scifact")
    split="test",  # Specify the split (e.g., "test")
)

# Step 4: Encode the documents
documents_embeddings = model.encode(
    [document["text"] for document in documents],
    batch_size=32,
    is_query=False,  # Indicate that these are documents
    show_progress_bar=True,
)

# Step 5: Add document embeddings to the index
index.add_documents(
    documents_ids=[document["id"] for document in documents],
    documents_embeddings=documents_embeddings,
)

# Step 6: Encode the queries
queries_embeddings = model.encode(
    queries,
    batch_size=32,
    is_query=True,  # Indicate that these are queries
    show_progress_bar=True,
)

# Step 7: Retrieve top-k documents
retriever = retrieve.ColBERT(index=index)
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=100,  # Retrieve the top 100 matches for each query
)

# Step 8: Evaluate the retrieval results
results = evaluation.evaluate(
    scores=scores,
    qrels=qrels,
    queries=queries,
    metrics=[f"ndcg@{k}" for k in [1, 3, 5, 10, 100]] # NDCG for different k values
    + [f"hits@{k}" for k in [1, 3, 5, 10, 100]]       # Hits at different k values
    + ["map"]                                         # Mean Average Precision (MAP)
    + ["recall@10", "recall@100"]                     # Recall at k
    + ["precision@10", "precision@100"],              # Precision at k
)

print(results)

The output is a dictionary containing various evaluation metrics. Here’s a sample output:

{
    "ndcg@1": 0.47333333333333333,
    "ndcg@3": 0.543862513095773,
    "ndcg@5": 0.5623210323686343,
    "ndcg@10": 0.5891793972249917,
    "ndcg@100": 0.5891793972249917,
    "hits@1": 0.47333333333333333,
    "hits@3": 0.64,
    "hits@5": 0.7033333333333334,
    "hits@10": 0.8,
    "hits@100": 0.8,
    "map": 0.5442202380952381,
    "recall@10": 0.7160555555555556,
    "recall@100": 0.7160555555555556,
    "precision@10": 0.08,
    "precision@100": 0.008000000000000002,
}
Info
  1. is_query flag: Always set is_query=True when encoding queries and is_query=False when encoding documents. This ensures the model applies the correct prefixes for queries and documents.

  2. Evaluation metrics: The pipeline supports a wide range of evaluation metrics, including NDCG, hits, MAP, recall, and precision, with different cutoff points.

  3. Relevance judgments (qrels): The qrels are used to calculate how well the retrieved documents match the ground truth.

BEIR datasets

The following table lists the datasets available in the BEIR benchmark along with their names, types, number of queries, corpus size, and relevance degree per query. Source: BEIR Datasets

Dataset BEIR-Name Type Queries Corpus
MSMARCO msmarco train, dev, test 6,980 8,840,000
TREC-COVID trec-covid test 50 171,000
NFCorpus nfcorpus train, dev, test 323 3,600
BioASQ bioasq train, test 500 14,910,000
NQ nq train, test 3,452 2,680,000
HotpotQA hotpotqa train, dev, test 7,405 5,230,000
FiQA-2018 fiqa train, dev, test 648 57,000
Signal-1M(RT) signal1m test 97 2,860,000
TREC-NEWS trec-news test 57 595,000
Robust04 robust04 test 249 528,000
ArguAna arguana test 1,406 8,670
Touche-2020 webis-touche2020 test 49 382,000
CQADupstack cqadupstack test 13,145 457,000
Quora quora dev, test 10,000 523,000
DBPedia dbpedia-entity dev, test 400 4,630,000
SCIDOCS scidocs test 1,000 25,000
FEVER fever train, dev, test 6,666 5,420,000
Climate-FEVER climate-fever test 1,535 5,420,000
SciFact scifact train, test 300 5,000

Custom datasets

You can also run evaluation on your custom dataset using the following structure:

  • corpus.jsonl: each row contains a json element with two properties: ['_id', 'text']
    • _id refers to the document identifier.
    • text contains the text of the document.
    • (an additional title field can also be added if necessary)
  • queries.jsonl: each row contains a json element with two properties: ['_id', 'text']
    • _id refers to the query identifier.
    • text contains the text of the query.
  • qrels folder contains tsv files with three columns: ['query-id', 'doc-id', 'score']
    • query-id refers to the query identifier.
    • doc-id refers to the document identifier.
    • score contains the relation between the query and the document (1 if relevant, else 0) The name of the tsv corresponds to the split (e.g, "dev").

You can then use the same pipeline as with BEIR datasets by changing the loading of the data in step 3:

documents, queries, qrels = evaluation.load_custom_dataset(
    "custom_dataset", split="dev"
)

Metrics

PyLate evaluation is based on Ranx Python library to compute standard Information Retrieval metrics. The following metrics are supported:

Metric Alias @k
Hits hits Yes
Hit Rate / Success hit_rate Yes
Precision precision Yes
Recall recall Yes
F1 f1 Yes
R-Precision r_precision No
Bpref bpref No
Rank-biased Precision rbp No
Mean Reciprocal Rank mrr Yes
Mean Average Precision map Yes
DCG dcg Yes
DCG Burges dcg_burges Yes
NDCG ndcg Yes
NDCG Burges ndcg_burges Yes

For any details about the metrics, please refer to Ranx documentation.

Sample code to evaluate the retrieval results using specific metrics:

results = evaluation.evaluate(
    scores=scores,
    qrels=qrels,
    queries=queries,
    metrics=[f"ndcg@{k}" for k in [1, 3, 5, 10, 100]] # NDCG for different k values
    + [f"hits@{k}" for k in [1, 3, 5, 10, 100]]       # Hits at different k values
    + ["map"]                                         # Mean Average Precision (MAP)
    + ["recall@10", "recall@100"]                     # Recall at k
    + ["precision@10", "precision@100"],              # Precision at k
)