Voyager¶

Voyager index. The Voyager index is a fast and efficient index for approximate nearest neighbor search.

Parameters¶

index_folder ('str') – defaults to indexes
index_name ('str') – defaults to colbert
override ('bool') – defaults to False

Whether to override the collection if it already exists.
embedding_size ('int') – defaults to 128

The number of dimensions of the embeddings.
M ('int') – defaults to 64

The number of subquantizers.
ef_construction ('int') – defaults to 200

The number of candidates to evaluate during the construction of the index.
ef_search ('int') – defaults to 200

The number of candidates to evaluate during the search.

Examples¶

>>> from pylate import indexes, models

>>> index = indexes.Voyager(
...     index_folder="test_indexes",
...     index_name="colbert",
...     override=True,
...     embedding_size=128,
... )

>>> model = models.ColBERT(
...     model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
... )

>>> documents_embeddings = model.encode(
...     ["fruits are healthy.", "fruits are good for health.", "fruits are bad for health."],
...     is_query=False,
... )

>>> index = index.add_documents(
...     documents_ids=["1", "2", "3"],
...     documents_embeddings=documents_embeddings
... )

>>> queries_embeddings = model.encode(
...     ["fruits are healthy.", "fruits are good for health and fun."],
...     is_query=True,
... )

>>> matchs = index(queries_embeddings, k=30)

>>> assert matchs["distances"].shape[0] == 2
>>> assert isinstance(matchs, dict)
>>> assert "documents_ids" in matchs
>>> assert "distances" in matchs

>>> queries_embeddings = model.encode(
...     "fruits are healthy.",
...     is_query=True,
... )

>>> matchs = index(queries_embeddings, k=30)

>>> assert matchs["distances"].shape[0] == 1
>>> assert isinstance(matchs, dict)
>>> assert "documents_ids" in matchs
>>> assert "distances" in matchs
>>> index = indexes.Voyager(
...     index_folder="test_indexes",
...     index_name="colbert",
...     override=False,
... )
>>> matchs = index(queries_embeddings, k=30)
>>> assert isinstance(matchs, dict)
>>> assert "documents_ids" in matchs
>>> assert "distances" in matchs
>>> index = index.remove_documents(
...     documents_ids=["1"],
... )
>>> matchs = index(queries_embeddings, k=30)
>>> assert isinstance(matchs, dict)
>>> assert "documents_ids" in matchs
>>> assert "distances" in matchs
>>> index = index.add_documents(
...     documents_ids=["1"],
...     documents_embeddings=documents_embeddings[0],
... )
>>> matchs = index(queries_embeddings, k=30)
>>> assert isinstance(matchs, dict)
>>> assert "documents_ids" in matchs
>>> assert "distances" in matchs

Methods¶

call

Query the index for the nearest neighbors of the queries embeddings.

Parameters

queries_embeddings ('np.ndarray | torch.Tensor')
k ('int') – defaults to 10

add_documents

Add documents to the index.

Parameters

documents_ids ('str | list[str]')
documents_embeddings ('list[np.ndarray | torch.Tensor]')
batch_size ('int') – defaults to 2000

get_documents_embeddings

Retrieve document embeddings for re-ranking from Voyager.

Parameters

document_ids ('list[list[str]]')

remove_documents

Remove documents from the index.

Parameters

documents_ids ('list[str]')