TachiomIndex¶

TACHIOM index for late-interaction multi-vector retrieval.

Wraps the Rust-backed TACHIOM library (TAC + PQ + HNSW) as a drop-in PyLate index. Token-Aware Clustering groups token embeddings by vocabulary ID before k-means, which improves clustering speed and retrieval quality over standard k-means.

Encode documents with output_value=None to enable Token-Aware Clustering. The returned dict (keys: "token_embeddings", "input_ids", "masks", "attention_mask") carries vocabulary token IDs that add_documents extracts automatically:

embeddings = model.encode(docs, is_query=False, output_value=None)     index.add_documents(doc_ids, embeddings)

If documents_token_ids is omitted, all tokens are assigned ID 0 so TAC degrades to a single global k-means. A UserWarning is issued.

Parameters¶

index_folder ('str') – defaults to indexes

Directory that will contain the index sub-folder.
index_name ('str') – defaults to tachiom

Name of the index sub-folder inside index_folder.
override ('bool') – defaults to False

Delete and recreate the index directory if it already exists.
center_dataset ('bool') – defaults to True

Subtract the global mean token vector from all document vectors before building the index. Default: True. May improve HNSW quality.
total_centroids ('int | None') – defaults to None

TAC coarse-centroid budget. None (default) auto-computes as max(2^round(log2(n_tokens/128)), ceil(min_tac_budget * 1.1)), ensuring TAC is used rather than falling back to global k-means.
tac_n_iter ('int') – defaults to 10

K-means iterations for Token-Aware Clustering. Default: 10. Reduce for fast experimentation; raise for maximum quality. 10 is usually enough.
tac_micro_threshold ('int | None') – defaults to None

Token groups with fewer vectors than this receive 1 centroid each. None (default) auto-derives as 2^round(log2(n_tokens^0.25)) clamped to [32, 128].
tac_small_threshold ('int | None') – defaults to None

Token groups in [micro, small) receive 2 centroids each. None (default) auto-derives as 2 * tac_micro_threshold.
pq_sample_size ('int') – defaults to 10000000

Maximum number of vectors sampled for PQ codebook training. Default: 10,000,000. Safe to lower on small corpora. May be increased on very large datasets (> 1B tokens).
pq_n_iter ('int') – defaults to 10

K-means iterations for PQ codebook training. Default: 10. Same trade-off as tac_n_iter.
normalize ('bool') – defaults to True

L2-normalise residuals before PQ encoding. Default: True. Leave as True unless you have a specific reason to disable.
pq_seed ('int') – defaults to 42

Random seed for PQ codebook training. Default: 42. Fix for reproducibility; change to get a different codebook.
hnsw_m ('int') – defaults to 32

HNSW graph degree (edges per node). Default: 32. Higher = better recall and more memory. Typical range: 16–64.
ef_construction ('int') – defaults to 1500

HNSW build-time search width. Default: 1500. Purposefully high to maximise recall; up to 1–2M centroids the HNSW build time is negligible compared to TAC and PQ. For very large centroid counts (> 2M) reduce to keep build times reasonable. Hardly gives benefits above 1500.
k_centroids ('int') – defaults to 20

Coarse centroids probed per query token at search time (n_probe in most IVF-based algorithms). Default: 20. Higher = more candidates, better recall, slower search.
k_docs_to_score ('int') – defaults to 500

Candidate pool size for full late-interaction MaxSim scoring. Default: 500. Must be ≥ k. Alpha pruning may further reduce this pool before MaxSim. Increase for higher recall at the cost of latency.
ef_search ('int | None') – defaults to None

HNSW search-time exploration width. None (default) resolves to round(1.5 × k_centroids) at search time, keeping the two coupled automatically. Set explicitly only to deviate from the 1.5× rule.
alpha ('float | None') – defaults to 0.45

Coarse-score pruning threshold: candidates whose coarse score falls more than alpha × score[k] below the k-th best are dropped before MaxSim. Default: 0.45. Range [0, 1]; usually effective in [0, 0.5]. Smaller = more aggressive pruning = faster search but worse recall. None disables pruning (all k_docs_to_score candidates scored).
beta ('int | None') – defaults to None

Early-termination patience: stop MaxSim scoring after this many consecutive non-improving documents. None = disabled (score all).
lambda_ ('float | None') – defaults to None

HNSW early-exit parameter. Makes search faster; tune together with ef_search. None = disabled.
num_threads ('int') – defaults to 0

Worker threads for batch_search. 0 = rayon default (all cores), 1 = single-threaded, n = custom pool of size n.

Methods¶

call

Search the index for the nearest documents to each query.

Parameters

queries_embeddings ('np.ndarray | torch.Tensor | list[np.ndarray] | list[torch.Tensor]')
k ('int') – defaults to 10

add_documents

Index a set of documents.

Parameters

documents_ids ('list[str]')
documents_embeddings ('list[np.ndarray | torch.Tensor]')
documents_token_ids ('list[np.ndarray] | None') – defaults to None
kwargs

get_documents_embeddings

Return approximate token embeddings for the requested documents.

Embeddings are reconstructed from stored PQ codes via approx = coarse_centroid + norm * PQ_residual and are therefore approximate (PQ lossy compression). When the index was built with center_dataset=True (the default), the dataset mean is added back so that the returned embeddings are in the original embedding space.

Parameters

documents_ids ('list[list[str]]')

Returns

list[list[np.ndarray]]: list[list[np.ndarray]]

remove_documents

References¶

If you use TACHIOM in your research, please cite::

@misc{martinico2026efficientmultivectorretrievaltokenaware,
      title={Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing},
      author={Silvio Martinico and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},
      year={2026},
      eprint={2604.28142},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2604.28142},
}