Skip to content

ScaNN

ScaNN index. The ScaNN index is a fast and efficient index for approximate nearest neighbor search.

Important Notes: - ScaNN is an approximate nearest neighbor search (not exact), designed for large-scale datasets - For ColBERT retrieval, PLAID is typically faster and more accurate as it's optimized for ColBERT scoring - ScaNN is CPU-only (no GPU acceleration) - Parameters are auto-tuned based on dataset size if not specified

To use this index, you need to install the scann extra:

bash pip install "pylate[scann]"

or install scann directly:

bash pip install scann

Parameters

  • index_name ('str | None') – defaults to ScaNN_index

    The name of the index collection.

  • embedding_size ('int') – defaults to 128

    The number of dimensions of the embeddings.

  • num_neighbors ('int | None') – defaults to 10

    The number of neighbors to use for the ScaNN searcher.

  • num_leaves ('int | None') – defaults to None

    The number of leaves in the ScaNN tree. If None, auto-tuned based on dataset size. For small datasets (<100K vectors), fewer leaves are used for speed.

  • num_leaves_to_search ('int | None') – defaults to None

    The number of leaves to search during query time. If None, auto-tuned based on dataset size. Higher values improve recall but slow down search.

  • dimensions_per_block ('int | None') – defaults to 2

    The number of dimensions to use for each block. If None, auto-tuned based on dataset size. Defaults to 2.

  • anisotropic_quantization_threshold ('float | None') – defaults to 0.2

    The threshold for anisotropic quantization. If None, auto-tuned based on dataset size. Defaults to 0.2.

  • training_sample_size ('int | None') – defaults to None

    The number of samples to use for training the ScaNN index.

  • verbose ('bool | str') – defaults to none

    Verbosity configuration: - False or "none": disable logs - True or "init" or "all": log build/load/indexing

  • use_autopilot ('bool') – defaults to False

    Whether to use ScaNN's autopilot() method for automatic parameter tuning. If True, overrides num_leaves, num_leaves_to_search, and training_sample_size. Defaults to False.

  • store_embeddings ('bool') – defaults to True

    Whether to store the embeddings in the index. If True, the embeddings will be stored in the index. Defaults to True. This is required to use the get_documents_embeddings method.

  • index_folder ('str | None') – defaults to None

    The folder where the index will be saved/loaded. If None, indices are not persisted to disk. Defaults to None.

  • override ('bool') – defaults to False

    Whether to override the index if it already exists. If False and index exists, it will be loaded. Defaults to False.

  • verbose_level ('str | None') – defaults to None

    Backward-compatible alias for verbosity scope ("none", "init", "all"). If set, it overrides verbose.

Methods

call

Query the index for the nearest neighbors of the queries embeddings.

Parameters

  • queries_embeddings ('list[list[int | float]]')
  • k ('int') – defaults to 10
add_documents

Add documents to the index.

Note: This method only supports adding all documents at once. Subsequent calls will raise an error. batch_size is kept for API compatibility but not used.

Parameters

  • documents_ids ('list[str]')
  • documents_embeddings ('list[torch.Tensor | np.ndarray]')
  • batch_size ('int') – defaults to 128
get_documents_embeddings

Get document embeddings by their IDs.

Parameters

  • documents_ids ('list[list[str]]')

Returns

list[list[np.ndarray]]: list[list[np.ndarray]]

remove_documents

Remove documents from the index.

Not supported for ScaNN index.

Parameters

  • documents_ids ('list[str]')
save

Save the index to disk.