Quick Start
Run with Docker (recommended):
# CPU with built-in model
docker run -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
ghcr.io/lightonai/next-plaid:cpu-1.0.6 \
--host 0.0.0.0 --port 8080 --index-dir /data/indices \
--model lightonai/answerai-colbert-small-v1-onnx --int8
# GPU with CUDA
docker run --gpus all -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
ghcr.io/lightonai/next-plaid:cuda-1.0.6 \
--host 0.0.0.0 --port 8080 --index-dir /data/indices \
--model lightonai/GTE-ModernColBERT-v1 --cuda
Use from Python:
pip install next-plaid-client
from next_plaid_client import NextPlaidClient, IndexConfig
client = NextPlaidClient("http://localhost:8080")
# Create index and add documents
client.create_index("docs", IndexConfig(nbits=4))
client.add("docs",
documents=["NextPlaid is a multi-vector database", "ColGREP searches code semantically"],
metadata=[{"id": "doc_1"}, {"id": "doc_2"}],
)
# Search
results = client.search("docs", ["vector database"])
# Search with metadata filtering
results = client.search("docs", ["coding tool"],
filter_condition="id = ?", filter_parameters=["doc_1"],
)
# Delete by predicate
client.delete("docs", "id = ?", ["doc_1"])
Or call the API directly:
# Create index
curl -X POST http://localhost:8080/indices \
-H 'Content-Type: application/json' \
-d '{"name": "docs", "config": {"nbits": 4}}'
# Add documents (text encoded server-side)
curl -X POST http://localhost:8080/indices/docs/update_with_encoding \
-H 'Content-Type: application/json' \
-d '{"documents": ["hello world"], "metadata": [{"title": "test"}]}'
# Search
curl -X POST http://localhost:8080/indices/docs/search_with_encoding \
-H 'Content-Type: application/json' \
-d '{"queries": ["hello"], "params": {"top_k": 5}}'
Interactive docs at http://localhost:8080/swagger-ui.
Two Modes
NextPlaid API runs in two modes depending on whether you pass --model:
| With --model | Without --model |
| Encoding | Pass text, get results. Server encodes via ONNX Runtime. | You encode externally, pass embedding arrays. |
| Endpoints | All endpoints available, including *_with_encoding | Core endpoints only. Encoding endpoints return 400. |
| Use case | Production deployments, Python SDK | Custom models, external encoding pipelines |
API Reference
Health & Documentation
| Method | Path | Description |
GET | /health | Health check with system info, model config, all index summaries |
GET | / | Alias for /health |
GET | /swagger-ui | Interactive Swagger UI |
GET | /api-docs/openapi.json | OpenAPI 3.0 specification |
Index Management
| Method | Path | Description |
GET | /indices | List all indices |
POST | /indices | Declare a new index (config only, no data) |
GET | /indices/{name} | Get index info (docs, partitions, dimension) |
DELETE | /indices/{name} | Delete an index and all its data |
PUT | /indices/{name}/config | Update config (e.g. max_documents) |
Documents
| Method | Path | Returns | Description |
POST | /indices/{name}/update | 202 | Add documents with pre-computed embeddings |
POST | /indices/{name}/update_with_encoding | 202 | Add documents as text (server encodes) |
POST | /indices/{name}/documents | 202 | Add to existing index (legacy) |
DELETE | /indices/{name}/documents | 202 | Delete by SQL WHERE condition |
All document mutations return 202 Accepted and process asynchronously. Concurrent requests to the same index are batched automatically.
Search
| Method | Path | Description |
POST | /indices/{name}/search | Search with embedding arrays |
POST | /indices/{name}/search/filtered | Search + SQL metadata filter |
POST | /indices/{name}/search_with_encoding | Search with text queries |
POST | /indices/{name}/search/filtered_with_encoding | Text search + metadata filter |
Metadata
| Method | Path | Description |
GET | /indices/{name}/metadata | Get all metadata entries |
GET | /indices/{name}/metadata/count | Count metadata entries |
POST | /indices/{name}/metadata/check | Check which doc IDs have metadata |
POST | /indices/{name}/metadata/query | Get doc IDs matching SQL condition |
POST | /indices/{name}/metadata/get | Get metadata by IDs or SQL condition |
POST | /indices/{name}/metadata/update | Update metadata rows matching condition |
Encoding & Reranking
| Method | Path | Description |
POST | /encode | Encode texts to ColBERT embeddings |
POST | /rerank | Rerank with pre-computed embeddings (MaxSim) |
POST | /rerank_with_encoding | Rerank with text (server encodes + MaxSim) |
Request & Response Examples
Create Index
POST /indices
{
"name": "my_index",
"config": {
"nbits": 4,
"batch_size": 50000,
"seed": 42,
"start_from_scratch": 999,
"max_documents": 10000
}
}
| Field | Default | Description |
nbits | 4 | Quantization bits (2 or 4) |
batch_size | 50000 | Documents per indexing chunk |
seed | null | Random seed for K-means |
start_from_scratch | 999 | Below this doc count, full rebuild on update |
max_documents | null | Evict oldest when exceeded (null = unlimited) |
Add Documents (text)
POST /indices/my_index/update_with_encoding
{
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany."
],
"metadata": [{ "country": "France" }, { "country": "Germany" }],
"pool_factor": 2
}
Returns 202 Accepted. The pool_factor reduces token count via hierarchical clustering (e.g. 2 = ~50% fewer embeddings per document).
Add Documents (embeddings)
POST /indices/my_index/update
{
"documents": [
{
"embeddings": [
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]
]
},
{
"embeddings": [
[0.7, 0.8, 0.9],
[0.1, 0.2, 0.3]
]
}
],
"metadata": [{ "title": "Doc A" }, { "title": "Doc B" }]
}
Search (text)
POST /indices/my_index/search_with_encoding
{
"queries": ["What is the capital of France?"],
"params": { "top_k": 10 }
}
Response:
{
"results": [
{
"query_id": 0,
"document_ids": [0, 1],
"scores": [18.42, 12.67],
"metadata": [{ "country": "France" }, { "country": "Germany" }]
}
],
"num_queries": 1
}
Search with Filter
POST /indices/my_index/search/filtered_with_encoding
{
"queries": ["capital city"],
"params": { "top_k": 5 },
"filter_condition": "country = ?",
"filter_parameters": ["France"]
}
Search Parameters
| Parameter | Default | Description |
top_k | 10 | Results to return per query |
n_ivf_probe | 8 | IVF cells to probe per query token |
n_full_scores | 4096 | Candidates for exact re-ranking |
centroid_score_threshold | null | Prune low-scoring centroids (e.g. 0.4) |
Delete Documents
DELETE /indices/my_index/documents
{
"condition": "country = ? AND year < ?",
"parameters": ["outdated", 2020]
}
Returns 202 Accepted. Deletes are batched: multiple delete requests within a short window are processed together.
Encode
POST /encode
{
"texts": ["Paris is the capital of France."],
"input_type": "document",
"pool_factor": 2
}
Response:
{
"embeddings": [[[0.1, 0.2, ...], [0.3, 0.4, ...]]],
"num_texts": 1
}
input_type is "query" or "document". Queries use MASK token expansion. Documents filter padding tokens.
Rerank
POST /rerank_with_encoding
{
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany."
],
"pool_factor": null
}
Response:
{
"results": [
{ "index": 0, "score": 15.23 },
{ "index": 1, "score": 8.12 }
],
"num_documents": 2
}
Health
GET /health
{
"status": "healthy",
"version": "1.0.1",
"loaded_indices": 1,
"index_dir": "/data/indices",
"memory_usage_bytes": 104857600,
"indices": [
{
"name": "my_index",
"num_documents": 1000,
"num_embeddings": 50000,
"num_partitions": 512,
"dimension": 128,
"nbits": 4,
"avg_doclen": 50.0,
"has_metadata": true
}
],
"model": {
"name": "GTE-ModernColBERT-v1",
"path": "/models/GTE-ModernColBERT-v1",
"quantized": false,
"embedding_dim": 128,
"batch_size": 128,
"num_sessions": 1,
"query_prefix": "[Q] ",
"document_prefix": "[D] ",
"query_length": 48,
"document_length": 300,
"do_query_expansion": true
}
}
Error Codes
All errors return JSON:
{
"code": "ERROR_CODE",
"message": "Human-readable description",
"details": null
}
| Code | HTTP | When |
INDEX_NOT_FOUND | 404 | Index does not exist |
INDEX_ALREADY_EXISTS | 409 | Index name already taken |
INDEX_NOT_DECLARED | 404 | Must POST /indices before updating |
BAD_REQUEST | 400 | Invalid parameters |
DIMENSION_MISMATCH | 400 | Embedding dim doesn't match index |
METADATA_NOT_FOUND | 404 | No metadata database for this index |
MODEL_NOT_LOADED | 400 | Encoding endpoint needs --model |
MODEL_ERROR | 500 | ONNX inference failed |
SERVICE_UNAVAILABLE | 503 | Queue full, retry later |
RATE_LIMITED | 429 | Too many requests, requires RATE_LIMIT_ENABLED (retry after 2s) |
INTERNAL_ERROR | 500 | Unexpected server error |
Python SDK
pip install next-plaid-client
Both sync and async clients:
from next_plaid_client import NextPlaidClient, AsyncNextPlaidClient
from next_plaid_client import IndexConfig, SearchParams
# Sync
client = NextPlaidClient("http://localhost:8080")
# Async
client = AsyncNextPlaidClient("http://localhost:8080")
await client.search("docs", ["query"])
SDK Methods
| Method | Description |
client.health() | Health check |
client.create_index(name, config) | Create index |
client.delete_index(name) | Delete index |
client.get_index(name) | Get index info |
client.list_indices() | List all indices |
client.add(name, documents, metadata) | Add documents (text or embeddings) |
client.search(name, queries, params, filter_condition, filter_parameters) | Search |
client.delete(name, condition, parameters) | Delete by filter |
client.encode(texts, input_type, pool_factor) | Encode texts |
client.rerank(query, documents) | Rerank documents |
Docker
Images
# CPU (amd64 + arm64)
docker pull ghcr.io/lightonai/next-plaid:cpu-1.0.6
# CUDA (amd64, requires NVIDIA GPU)
docker pull ghcr.io/lightonai/next-plaid:cuda-1.0.6
The Docker entrypoint auto-downloads HuggingFace models. Pass org/model as --model and it handles the rest. Set HF_TOKEN for private models.
Docker Compose (CPU)
services:
next-plaid-api:
image: ghcr.io/lightonai/next-plaid:cpu-1.0.6
ports:
- "8080:8080"
volumes:
- ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
- ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
environment:
- RUST_LOG=info
command:
- --host
- "0.0.0.0"
- --port
- "8080"
- --index-dir
- /data/indices
- --model
- lightonai/answerai-colbert-small-v1-onnx
- --int8
- --parallel
- "16"
- --batch-size
- "4"
healthcheck:
test:
["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 2
start_period: 120s
restart: unless-stopped
deploy:
resources:
limits:
memory: 16G
Docker Compose (CUDA)
services:
next-plaid-api:
image: ghcr.io/lightonai/next-plaid:cuda-1.0.6
ports:
- "8080:8080"
volumes:
- ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
- ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
environment:
- RUST_LOG=info
- NVIDIA_VISIBLE_DEVICES=all
command:
- --host
- "0.0.0.0"
- --port
- "8080"
- --index-dir
- /data/indices
- --model
- lightonai/GTE-ModernColBERT-v1
- --cuda
- --batch-size
- "128"
healthcheck:
test:
["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 2
start_period: 120s
restart: unless-stopped
deploy:
resources:
limits:
memory: 16G
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Volume Mounts
| Host Path | Container Path | Purpose |
~/.local/share/next-plaid | /data/indices | Persistent index storage |
~/.cache/huggingface/next-plaid | /models | HuggingFace model cache |
CLI Reference
next-plaid-api [OPTIONS]
| Flag | Default | Description |
-h, --host | 0.0.0.0 | Bind address |
-p, --port | 8080 | Bind port |
-d, --index-dir | ./indices | Index storage directory |
-m, --model | _(none)_ | ONNX model path or HuggingFace ID |
--cuda | off | CUDA for model inference |
--int8 | off | INT8 quantized model (~2x faster on CPU) |
--parallel | 1 | Parallel ONNX sessions (recommended: 8-25 for throughput) |
--batch-size | auto | Batch size per session (32 CPU, 64 GPU, 2 parallel) |
--threads | auto | Threads per ONNX session (auto: 1 when parallel) |
--query-length | 48 | Max query length in tokens |
--document-length | 300 | Max document length in tokens |
--model-pool-size | 1 | Number of model worker instances for concurrent encoding |
# Embeddings-only (no model)
next-plaid-api -p 3000 -d /data/indices
# CPU with model
next-plaid-api --model lightonai/answerai-colbert-small-v1-onnx --int8 --parallel 16
# GPU
next-plaid-api --model lightonai/GTE-ModernColBERT-v1 --cuda --batch-size 128
# Debug logging
RUST_LOG=debug next-plaid-api --model ./models/colbert
Architecture
flowchart TD
subgraph API["REST API (Axum)"]
H["/health"]
I["/indices/*"]
S["/search"]
E["/encode"]
R["/rerank"]
end
subgraph Middleware
RL["Rate Limiter
token bucket · optional"]
CL["Concurrency Limiter"]
TR["Tracing
X-Request-ID"]
TO["Timeout
30s health · 300s ops"]
end
subgraph Workers["Background Workers"]
UQ["Update Batch Queue
per index"]
DQ["Delete Batch Queue
per index"]
EQ["Encode Batch Queue
global"]
end
subgraph Core["Core (next-plaid)"]
NP["MmapIndex
IVF + PQ + MaxSim"]
SQ["SQLite
Metadata Filtering"]
end
subgraph Model["Model (next-plaid-onnx)"]
OX["ONNX Runtime
ColBERT Encoder"]
end
API --> Middleware
I --> UQ
I --> DQ
E --> EQ
UQ --> NP
DQ --> NP
UQ --> SQ
DQ --> SQ
S --> NP
S --> SQ
EQ --> OX
R --> OX
style API fill:#4a90d9,stroke:#357abd,color:#fff
style Middleware fill:#50b86c,stroke:#3d9956,color:#fff
style Workers fill:#e8913a,stroke:#d07a2e,color:#fff
style Core fill:#9b59b6,stroke:#8445a0,color:#fff
style Model fill:#e74c3c,stroke:#c0392b,color:#fff
Concurrency Design
The API uses lock-free reads and batched writes for high throughput:
- Reads (search, metadata queries): Lock-free via
ArcSwap. Readers never block, even during writes.
- Index updates: Per-index batch queue collects requests, processes up to 300 documents (or 100ms timeout) in a single atomic operation.
- Deletes: Per-index delete queue batches conditions, resolves IDs inside the lock to handle ID shifting correctly.
- Encoding: Global worker pool with
N model instances. Requests are grouped by input_type and pool_factor, then encoded in a single batch.
- Auto-repair: Before every update/delete, the API checks if the vector index and SQLite metadata are in sync. If not, it repairs automatically.
flowchart LR
R1["Request 1"] --> BQ["Batch Queue"]
R2["Request 2"] --> BQ
R3["Request 3"] --> BQ
BQ -->|"collect until
300 docs or 100ms"| BW["Batch Worker"]
BW -->|"acquire lock"| IDX["Index Update"]
IDX --> META["Metadata Update"]
META --> EVICT["Eviction Check"]
EVICT --> RELOAD["Atomic Reload
(ArcSwap)"]
style BQ fill:#e8913a,stroke:#d07a2e,color:#fff
style BW fill:#e8913a,stroke:#d07a2e,color:#fff
style IDX fill:#9b59b6,stroke:#8445a0,color:#fff
style META fill:#9b59b6,stroke:#8445a0,color:#fff
style EVICT fill:#9b59b6,stroke:#8445a0,color:#fff
style RELOAD fill:#50b86c,stroke:#3d9956,color:#fff
Rate Limiting
Rate limiting is optional and disabled by default. Enable it by setting RATE_LIMIT_ENABLED=true. When enabled, the API applies a token bucket algorithm to a subset of routes:
| Scope | Rate limited? | Why exempt |
/health, / | No | Monitoring must always work |
GET /indices, GET /indices/{name} | No | Clients poll during async operations |
POST /indices/{name}/update* | No | Has per-index semaphore protection |
DELETE /indices/{name}, DELETE /indices/{name}/documents | No | Has internal batching |
/encode, /rerank* | No | Has internal backpressure via queue |
| Everything else | Yes | Standard rate limiting |
Environment Variables
Rate Limiting & Concurrency
| Variable | Default | Description |
RATE_LIMIT_ENABLED | false | Enable rate limiting (true, 1, or yes to enable) |
RATE_LIMIT_PER_SECOND | 50 | Sustained requests/second (when enabled) |
RATE_LIMIT_BURST_SIZE | 100 | Max burst size (when enabled) |
CONCURRENCY_LIMIT | 100 | Max concurrent in-flight requests |
Document Batching
| Variable | Default | Description |
MAX_QUEUED_TASKS_PER_INDEX | 10 | Max pending updates per index (503 when full) |
MAX_BATCH_DOCUMENTS | 300 | Documents per batch before processing |
BATCH_CHANNEL_SIZE | 100 | Buffer for document batch queue |
Encode Batching
| Variable | Default | Description |
MAX_BATCH_TEXTS | 64 | Texts per encoding batch |
ENCODE_BATCH_CHANNEL_SIZE | 256 | Buffer for encode batch queue |
Delete Batching
| Variable | Default | Description |
DELETE_BATCH_MIN_WAIT | 500 | Min wait (ms) after first delete before processing |
DELETE_BATCH_MAX_WAIT | 2000 | Max wait (ms) for accumulating deletes |
MAX_DELETE_BATCH_CONDITIONS | 200 | Max conditions per delete batch |
Logging
| Variable | Default | Description |
RUST_LOG | info | Log level (debug, info, warn, error) |
HF_TOKEN | _(none)_ | HuggingFace token for private model downloads |
Feature Flags
| Feature | Description |
| _(default)_ | Core API, no BLAS, no model support |
openblas | OpenBLAS for matrix operations (Linux) |
accelerate | Apple Accelerate (macOS) |
model | ONNX model encoding (/encode, *_with_encoding) |
cuda | CUDA acceleration (implies model) |
# Embeddings-only API
cargo build --release -p next-plaid-api
# With model support (CPU, Linux)
cargo build --release -p next-plaid-api --features "openblas,model"
# With CUDA
cargo build --release -p next-plaid-api --features "cuda"
Modules
| Module | Lines | Description |
handlers/documents | 1,638 | Index CRUD, update batching, delete batching, eviction, auto-repair |
models | 759 | All request/response JSON schemas with OpenAPI annotations |
handlers/encode | 549 | Encode worker pool, batch grouping by input type, ONNX inference |
state | 488 | AppState, IndexSlot (ArcSwap), model pool, config caching |
handlers/search | 449 | Search + filtered search, metadata enrichment, text-to-search pipeline |
handlers/metadata | 484 | Metadata CRUD: check, query, get, count, update |
handlers/rerank | 292 | ColBERT MaxSim scoring, text and embedding reranking |
error | 138 | Error types with HTTP status code mapping |
tracing_middleware | 115 | Request tracing via X-Request-ID header |
main | 887 | CLI argument parsing, router construction, Swagger UI, server startup |
lib | 44 | PrettyJson response type, module re-exports |
Dependencies
| Crate | Purpose |
next-plaid | Core PLAID index (IVF + PQ + MaxSim) |
next-plaid-onnx | ColBERT ONNX encoding (optional) |
axum 0.8 | Web framework |
tokio | Async runtime |
tower / tower-http | Middleware (CORS, tracing, timeout, concurrency) |
tower_governor | Rate limiting (token bucket) |
utoipa / utoipa-swagger-ui | OpenAPI generation + Swagger UI |
arc-swap | Lock-free index swapping |
parking_lot | Fast read-write locks |
sysinfo | Process memory usage for /health |
uuid | Request trace IDs |
ndarray | N-dimensional arrays |
serde / serde_json | Serialization |
License
Apache-2.0