ScaNN¶
ScaNN index. The ScaNN index is a fast and efficient index for approximate nearest neighbor search.
Important Notes: - ScaNN is an approximate nearest neighbor search (not exact), designed for large-scale datasets - For ColBERT retrieval, PLAID is typically faster and more accurate as it's optimized for ColBERT scoring - ScaNN is CPU-only (no GPU acceleration) - Parameters are auto-tuned based on dataset size if not specified
To use this index, you need to install the scann extra:
bash pip install "pylate[scann]"
or install scann directly:
bash pip install scann
Parameters¶
-
index_name ('str | None') – defaults to
ScaNN_indexThe name of the index collection.
-
embedding_size ('int') – defaults to
128The number of dimensions of the embeddings.
-
num_neighbors ('int | None') – defaults to
10The number of neighbors to use for the ScaNN searcher.
-
num_leaves ('int | None') – defaults to
NoneThe number of leaves in the ScaNN tree. If None, auto-tuned based on dataset size. For small datasets (<100K vectors), fewer leaves are used for speed.
-
num_leaves_to_search ('int | None') – defaults to
NoneThe number of leaves to search during query time. If None, auto-tuned based on dataset size. Higher values improve recall but slow down search.
-
dimensions_per_block ('int | None') – defaults to
2The number of dimensions to use for each block. If None, auto-tuned based on dataset size. Defaults to 2.
-
anisotropic_quantization_threshold ('float | None') – defaults to
0.2The threshold for anisotropic quantization. If None, auto-tuned based on dataset size. Defaults to 0.2.
-
training_sample_size ('int | None') – defaults to
NoneThe number of samples to use for training the ScaNN index.
-
verbose ('bool | str') – defaults to
noneVerbosity configuration: -
Falseor"none": disable logs -Trueor"init"or"all": log build/load/indexing -
use_autopilot ('bool') – defaults to
FalseWhether to use ScaNN's autopilot() method for automatic parameter tuning. If True, overrides num_leaves, num_leaves_to_search, and training_sample_size. Defaults to False.
-
store_embeddings ('bool') – defaults to
TrueWhether to store the embeddings in the index. If True, the embeddings will be stored in the index. Defaults to True. This is required to use the get_documents_embeddings method.
-
index_folder ('str | None') – defaults to
NoneThe folder where the index will be saved/loaded. If None, indices are not persisted to disk. Defaults to None.
-
override ('bool') – defaults to
FalseWhether to override the index if it already exists. If False and index exists, it will be loaded. Defaults to False.
-
verbose_level ('str | None') – defaults to
NoneBackward-compatible alias for verbosity scope (
"none","init","all"). If set, it overridesverbose.
Methods¶
call
Query the index for the nearest neighbors of the queries embeddings.
Parameters
- queries_embeddings ('list[list[int | float]]')
- k ('int') – defaults to
10
add_documents
Add documents to the index.
Note: This method only supports adding all documents at once. Subsequent calls will raise an error. batch_size is kept for API compatibility but not used.
Parameters
- documents_ids ('list[str]')
- documents_embeddings ('list[torch.Tensor | np.ndarray]')
- batch_size ('int') – defaults to
128
get_documents_embeddings
Get document embeddings by their IDs.
Parameters
- documents_ids ('list[list[str]]')
Returns
list[list[np.ndarray]]: list[list[np.ndarray]]
remove_documents
Remove documents from the index.
Not supported for ScaNN index.
Parameters
- documents_ids ('list[str]')
save
Save the index to disk.