ScaNN¶

ScaNN index. The ScaNN index is a fast and efficient index for approximate nearest neighbor search.

Important Notes: - ScaNN is an approximate nearest neighbor search (not exact), designed for large-scale datasets - For ColBERT retrieval, PLAID is typically faster and more accurate as it's optimized for ColBERT scoring - ScaNN is CPU-only (no GPU acceleration) - Parameters are auto-tuned based on dataset size if not specified

To use this index, you need to install the scann extra:

bash pip install "pylate[scann]"

or install scann directly:

bash pip install scann

Parameters¶

index_name ('str | None') – defaults to ScaNN_index

The name of the index collection.
embedding_size ('int') – defaults to 128

The number of dimensions of the embeddings.
num_neighbors ('int | None') – defaults to 10

The number of neighbors to use for the ScaNN searcher.
num_leaves ('int | None') – defaults to None

The number of leaves in the ScaNN tree. If None, auto-tuned based on dataset size. For small datasets (<100K vectors), fewer leaves are used for speed.
num_leaves_to_search ('int | None') – defaults to None

The number of leaves to search during query time. If None, auto-tuned based on dataset size. Higher values improve recall but slow down search.
dimensions_per_block ('int | None') – defaults to 2

The number of dimensions to use for each block. If None, auto-tuned based on dataset size. Defaults to 2.
anisotropic_quantization_threshold ('float | None') – defaults to 0.2

The threshold for anisotropic quantization. If None, auto-tuned based on dataset size. Defaults to 0.2.
training_sample_size ('int | None') – defaults to None

The number of samples to use for training the ScaNN index.
verbose ('bool | str') – defaults to none

Verbosity configuration: - False or "none": disable logs - True or "init" or "all": log build/load/indexing
use_autopilot ('bool') – defaults to False

Whether to use ScaNN's autopilot() method for automatic parameter tuning. If True, overrides num_leaves, num_leaves_to_search, and training_sample_size. Defaults to False.
store_embeddings ('bool') – defaults to True

Whether to store the embeddings in the index. If True, the embeddings will be stored in the index. Defaults to True. This is required to use the get_documents_embeddings method.
index_folder ('str | None') – defaults to None

The folder where the index will be saved/loaded. If None, indices are not persisted to disk. Defaults to None.
override ('bool') – defaults to False

Whether to override the index if it already exists. If False and index exists, it will be loaded. Defaults to False.
verbose_level ('str | None') – defaults to None

Backward-compatible alias for verbosity scope ("none", "init", "all"). If set, it overrides verbose.

Methods¶

call

Query the index for the nearest neighbors of the queries embeddings.

Parameters

queries_embeddings ('list[list[int | float]]')
k ('int') – defaults to 10

add_documents

Add documents to the index.

Note: This method only supports adding all documents at once. Subsequent calls will raise an error. batch_size is kept for API compatibility but not used.

Parameters

documents_ids ('list[str]')
documents_embeddings ('list[torch.Tensor | np.ndarray]')
batch_size ('int') – defaults to 128

get_documents_embeddings

Get document embeddings by their IDs.

Parameters

documents_ids ('list[list[str]]')

Returns

list[list[np.ndarray]]: list[list[np.ndarray]]

remove_documents

Remove documents from the index.

Not supported for ScaNN index.

Parameters

documents_ids ('list[str]')

save

Save the index to disk.