NextPlaid
A Rust library for multi-vector search using the PLAID algorithm.
Memory-mapped, quantized, incremental. Powers ColGREP and the NextPlaid API.
Quick Start · Architecture · API Reference · Filtering · CUDA
Quick Start
Add to Cargo.toml:
[dependencies]
next-plaid = "1.0"
Create an index, add documents, search:
use next_plaid::{MmapIndex, IndexConfig, UpdateConfig, SearchParameters};
use ndarray::Array2;
// Each document is a 2D array [num_tokens, embedding_dim]
let embeddings: Vec<Array2<f32>> = encode_documents(&docs);
// Create index (or update if it already exists)
let index_config = IndexConfig { nbits: 4, ..Default::default() };
let update_config = UpdateConfig::default();
let (index, doc_ids) = MmapIndex::update_or_create(
&embeddings, "/path/to/index", &index_config, &update_config,
)?;
// Search
let query: Array2<f32> = encode_query("search text");
let params = SearchParameters { top_k: 10, ..Default::default() };
let results = index.search(&query, ¶ms, None)?;
for (id, score) in results.passage_ids.iter().zip(results.scores.iter()) {
println!("Doc {}: {:.4}", id, score);
}
Why Multi-Vector?
Standard vector search collapses a document into one embedding. That's lossy. Multi-vector search (ColBERT) keeps one embedding per token (~300 vectors per document, dim 128). At query time, each query token finds its best match across all document tokens (MaxSim). This preserves fine-grained information that single-vector models lose.
The trade-off is storage. NextPlaid solves this with product quantization (2-bit or 4-bit) and memory-mapped indices, making million-document collections practical on a single machine.
Architecture
flowchart TD
A["Document Embeddings\n[num_tokens, dim] per doc"] --> B["K-Means Clustering"]
B --> C["Centroid Assignment\n+ Residual Computation"]
C --> D["Product Quantization\n2-bit or 4-bit"]
D --> E["Memory-Mapped Index\nIVF + Codes + Residuals"]
Q["Query Embedding"] --> F["IVF Probing\nTop centroids per token"]
F --> G["Candidate Retrieval"]
G --> H["Approximate Scoring\nCentroid MaxSim"]
H --> I["Exact Re-ranking\nDecompress + Full MaxSim"]
I --> J["Top-K Results"]
E --> G
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#50b86c,stroke:#3d9956,color:#fff
style C fill:#50b86c,stroke:#3d9956,color:#fff
style D fill:#50b86c,stroke:#3d9956,color:#fff
style E fill:#e8913a,stroke:#d07a2e,color:#fff
style Q fill:#4a90d9,stroke:#357abd,color:#fff
style F fill:#9b59b6,stroke:#8445a0,color:#fff
style G fill:#9b59b6,stroke:#8445a0,color:#fff
style H fill:#9b59b6,stroke:#8445a0,color:#fff
style I fill:#9b59b6,stroke:#8445a0,color:#fff
style J fill:#9b59b6,stroke:#8445a0,color:#fff
Indexing Pipeline
- K-Means clustering on all token embeddings to find centroids (IVF codebook)
- Assign each token to its nearest centroid, compute residual (difference)
- Quantize residuals with product quantization (2-bit: 4 buckets, 4-bit: 16 buckets)
- Write IVF posting lists, codes, and residuals as memory-mapped NPY files
- Optionally store document metadata in a co-located SQLite database
Search Pipeline
- IVF probing — Score query tokens against centroids, select top
n_ivf_probecentroids per token - Candidate retrieval — Collect document IDs from selected posting lists
- Approximate scoring — MaxSim using centroid vectors (fast, coarse)
- Re-ranking — Decompress top
n_full_scorescandidates, compute exact ColBERT MaxSim - Return top
top_kresults with scores
MaxSim scoring is SIMD-accelerated (AVX2 on x86_64, NEON on ARM) and optionally BLAS-accelerated via Apple Accelerate or OpenBLAS.
Update Modes
Incremental updates use three strategies depending on index size:
| Mode | Condition | Behavior |
|---|---|---|
| Rebuild | num_docs <= start_from_scratch (default: 999) | Load existing embeddings + new, full K-means rebuild |
| Buffer | new_docs < buffer_size (default: 100) | Assign to existing centroids, buffer for later |
| Expand | new_docs >= buffer_size | Find outlier embeddings, expand centroids via K-means, re-index buffer + new |
Installation
Feature Flags
| Feature | Platform | Description |
|---|---|---|
| (default) | All | Pure Rust, no external BLAS |
accelerate | macOS | Apple Accelerate for BLAS (recommended on M-series) |
openblas | Linux | OpenBLAS for BLAS |
cuda | Linux/Windows | CUDA acceleration for K-means and MaxSim scoring |
# macOS (recommended)
[dependencies]
next-plaid = { version = "1.0", features = ["accelerate"] }
# Linux with OpenBLAS
[dependencies]
next-plaid = { version = "1.0", features = ["openblas"] }
# Linux with CUDA + OpenBLAS
[dependencies]
next-plaid = { version = "1.0", features = ["cuda", "openblas"] }
OpenBLAS system setup
__CODE_BLOCK_4__API Reference
MmapIndex
The primary interface. Loads index files as memory-mapped arrays for low RAM usage.
// Load existing index
let index = MmapIndex::load("/path/to/index")?;
// Create or update (creates if missing, appends if exists)
let (index, doc_ids) = MmapIndex::update_or_create(
&embeddings, "/path/to/index", &index_config, &update_config,
)?;
// Search (single query)
let results = index.search(&query, ¶ms, None)?;
// Search (batch)
let results = index.search_batch(&queries, ¶ms, true, None)?;
// Search within a subset of documents
let results = index.search(&query, ¶ms, Some(&subset_ids))?;
// Add documents to existing index
let new_ids = index.update(&new_embeddings, &update_config)?;
// Add documents with metadata
let new_ids = index.update_with_metadata(
&new_embeddings, &update_config, Some(&metadata),
)?;
// Delete documents
let deleted = index.delete(&[5, 10, 15])?;
// Reconstruct embeddings from compressed storage
let embeddings = index.reconstruct(&[0, 1, 2])?;
let single = index.reconstruct_single(0)?;
// Accessors
index.num_documents();
index.num_embeddings();
index.num_partitions();
index.avg_doclen();
index.embedding_dim();
IndexConfig
Controls index creation.
IndexConfig {
nbits: 4, // Quantization bits (2 or 4)
batch_size: 50_000, // Documents per indexing chunk
seed: Some(42), // Random seed for K-means
kmeans_niters: 4, // K-means iterations
max_points_per_centroid: 256,// K-means parameter
n_samples_kmeans: None, // Auto: min(1 + 16*sqrt(120*N), N)
start_from_scratch: 999, // Rebuild threshold
force_cpu: false, // Force CPU K-means (skip CUDA)
}
SearchParameters
Controls search behavior.
SearchParameters {
top_k: 10, // Results to return
n_full_scores: 4096, // Candidates for exact re-ranking
n_ivf_probe: 8, // IVF cells to probe per query token
centroid_batch_size: 100_000, // Batch size for centroid scoring
centroid_score_threshold: Some(0.4), // Prune low-scoring centroids
batch_size: 2000, // Query batch size
}
UpdateConfig
Controls incremental updates.
UpdateConfig {
batch_size: 50_000, // Documents per chunk
kmeans_niters: 4, // K-means iterations for centroid expansion
max_points_per_centroid: 256,
n_samples_kmeans: None,
seed: 42,
start_from_scratch: 999, // Below this, rebuild from scratch
buffer_size: 100, // Buffer before centroid expansion
force_cpu: false,
}
QueryResult
Search results container.
pub struct QueryResult {
pub query_id: usize,
pub passage_ids: Vec<i64>, // Document IDs (ranked by score)
pub scores: Vec<f32>, // Corresponding scores
}
pub type SearchResult = QueryResult;
Metadata
Index metadata, persisted as metadata.json.
pub struct Metadata {
pub num_chunks: usize,
pub nbits: usize,
pub num_partitions: usize,
pub num_embeddings: usize,
pub avg_doclen: f64,
pub num_documents: usize,
pub embedding_dim: usize,
pub next_plaid_compatible: bool,
}
ResidualCodec
Quantization codec for compression and decompression.
// Load codec from index directory
let codec = ResidualCodec::load_from_dir(&index_path)?;
let codec = ResidualCodec::load_mmap_from_dir(&index_path)?; // memory-mapped centroids
// Compress embeddings to codes (nearest centroid assignments)
let codes = codec.compress_into_codes(&embeddings);
// Quantize residuals
let quantized = codec.quantize_residuals(&residuals)?;
// Decompress back to approximate embeddings
let reconstructed = codec.decompress(&packed, &codes)?;
// Accessors
codec.embedding_dim();
codec.num_centroids();
codec.centroids_view();
Filtering
SQLite-based metadata filtering via the filtering module. Store arbitrary JSON metadata alongside your index, query with SQL WHERE clauses.
use next_plaid::filtering;
use serde_json::json;
// Create metadata database alongside index
let metadata = vec![
json!({"title": "Doc A", "category": "tech", "score": 95}),
json!({"title": "Doc B", "category": "science", "score": 87}),
];
filtering::create("/path/to/index", &metadata, &[0, 1])?;
// Append more metadata
filtering::update("/path/to/index", &new_metadata, &new_doc_ids)?;
// Query by SQL condition (parameterized, injection-safe)
let ids = filtering::where_condition(
"/path/to/index",
"category = ? AND score > ?",
&[json!("tech"), json!(90)],
)?;
// Query with REGEXP support
let ids = filtering::where_condition_regexp(
"/path/to/index",
"title REGEXP ?",
&[json!("^Doc")],
)?;
// Get full metadata rows
let rows = filtering::get("/path/to/index", None, &[], Some(&[0, 1]))?;
// Count documents
let count = filtering::count("/path/to/index")?;
// Delete and re-index
filtering::delete("/path/to/index", &[5, 10])?;
// Use in search: filter first, then search within subset
let subset = filtering::where_condition("/path/to/index", "category = ?", &[json!("tech")])?;
let results = index.search(&query, ¶ms, Some(&subset))?;
SQL conditions are validated with a recursive descent parser that whitelists safe operators and prevents injection.
CUDA Acceleration
When built with the cuda feature, NextPlaid automatically uses GPU acceleration for:
- K-means clustering during index creation and centroid expansion
- MaxSim scoring during search (for large enough matrices)
No code changes needed. CUDA falls back gracefully to CPU on failure.
[dependencies]
next-plaid = { version = "1.0", features = ["cuda"] }
The CUDA module uses cuBLAS for matrix multiplication and custom PTX kernels for argmax operations. A global CudaContext is lazily initialized on first use.
> Tip: First CUDA context creation can take 10-30s. Enable GPU persistence mode to reduce this: sudo nvidia-smi -pm 1
Index File Structure
index_directory/
metadata.json # Index metadata (num_docs, nbits, partitions, etc.)
centroids.npy # Centroid embeddings [K, dim]
avg_residual.npy # Average residual per dimension
bucket_cutoffs.npy # Quantization boundaries
bucket_weights.npy # Reconstruction values
cluster_threshold.npy # Outlier detection threshold
ivf.npy # Inverted file (doc IDs per centroid)
ivf_lengths.npy # Length of each IVF posting list
plan.json # Indexing plan
merged_codes.npy # Memory-mapped centroid codes (auto-merged)
merged_residuals.npy # Memory-mapped quantized residuals (auto-merged)
metadata.db # SQLite metadata database (optional)
# Per-chunk files (merged into merged_*.npy on load):
0.codes.npy # Centroid assignments for chunk 0
0.residuals.npy # Quantized residuals for chunk 0
0.metadata.json # Chunk metadata
doclens.0.json # Document lengths for chunk 0
Modules
| Module | Lines | Description |
|---|---|---|
filtering | 1,896 | SQLite metadata storage, SQL condition validation, REGEXP support |
mmap | 1,779 | Memory-mapped NPY/raw arrays, merge-on-load, file locking |
index | 1,389 | Index creation, MmapIndex, IndexConfig, Metadata |
update | 977 | Incremental updates, buffer/expand strategies |
cuda | 769 | CUDA context, cuBLAS MatMul, PTX argmax kernel |
codec | 701 | Residual quantization, compress/decompress, lookup tables |
search | 714 | IVF probing, candidate retrieval, approximate + exact scoring |
delete | 540 | Document deletion, IVF rebuild |
kmeans | 475 | K-means clustering, centroid computation, partition estimation |
maxsim | 443 | SIMD MaxSim (AVX2/NEON), BLAS matrix multiply |
utils | 237 | Quantile computation, array utilities |
embeddings | 137 | Embedding reconstruction from compressed storage |
error | 66 | Error types |
Dependencies
| Crate | Purpose |
|---|---|
ndarray | N-dimensional arrays |
rayon | Parallelism |
memmap2 | Memory-mapped files |
ndarray-npy | NPY file I/O |
fastkmeans-rs | K-means clustering |
rusqlite | SQLite (bundled) |
half | Float16 support |
regex | REGEXP filtering |
cudarc | CUDA bindings (optional) |
serde / serde_json | Serialization |
thiserror | Error handling |
License
Apache-2.0