NextPlaid API

A REST API for multi-vector search with built-in text encoding.
Async batching, metadata filtering, optional rate limiting, Swagger UI. Powers the NextPlaid ecosystem.

Quick Start · API Reference · Python SDK · Docker · Architecture


Quick Start

Run with Docker (recommended):

# CPU with built-in model
docker run -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cpu-1.0.6 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/answerai-colbert-small-v1-onnx --int8

# GPU with CUDA
docker run --gpus all -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cuda-1.0.6 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/GTE-ModernColBERT-v1 --cuda

Use from Python:

pip install next-plaid-client
from next_plaid_client import NextPlaidClient, IndexConfig

client = NextPlaidClient("http://localhost:8080")

# Create index and add documents
client.create_index("docs", IndexConfig(nbits=4))
client.add("docs",
    documents=["NextPlaid is a multi-vector database", "ColGREP searches code semantically"],
    metadata=[{"id": "doc_1"}, {"id": "doc_2"}],
)

# Search
results = client.search("docs", ["vector database"])

# Search with metadata filtering
results = client.search("docs", ["coding tool"],
    filter_condition="id = ?", filter_parameters=["doc_1"],
)

# Delete by predicate
client.delete("docs", "id = ?", ["doc_1"])

Or call the API directly:

# Create index
curl -X POST http://localhost:8080/indices \
  -H 'Content-Type: application/json' \
  -d '{"name": "docs", "config": {"nbits": 4}}'

# Add documents (text encoded server-side)
curl -X POST http://localhost:8080/indices/docs/update_with_encoding \
  -H 'Content-Type: application/json' \
  -d '{"documents": ["hello world"], "metadata": [{"title": "test"}]}'

# Search
curl -X POST http://localhost:8080/indices/docs/search_with_encoding \
  -H 'Content-Type: application/json' \
  -d '{"queries": ["hello"], "params": {"top_k": 5}}'

Interactive docs at http://localhost:8080/swagger-ui.


Two Modes

NextPlaid API runs in two modes depending on whether you pass --model:

With --modelWithout --model
EncodingPass text, get results. Server encodes via ONNX Runtime.You encode externally, pass embedding arrays.
EndpointsAll endpoints available, including *_with_encodingCore endpoints only. Encoding endpoints return 400.
Use caseProduction deployments, Python SDKCustom models, external encoding pipelines

API Reference

Health & Documentation

MethodPathDescription
GET/healthHealth check with system info, model config, all index summaries
GET/Alias for /health
GET/swagger-uiInteractive Swagger UI
GET/api-docs/openapi.jsonOpenAPI 3.0 specification

Index Management

MethodPathDescription
GET/indicesList all indices
POST/indicesDeclare a new index (config only, no data)
GET/indices/{name}Get index info (docs, partitions, dimension)
DELETE/indices/{name}Delete an index and all its data
PUT/indices/{name}/configUpdate config (e.g. max_documents)

Documents

MethodPathReturnsDescription
POST/indices/{name}/update202Add documents with pre-computed embeddings
POST/indices/{name}/update_with_encoding202Add documents as text (server encodes)
POST/indices/{name}/documents202Add to existing index (legacy)
DELETE/indices/{name}/documents202Delete by SQL WHERE condition
All document mutations return 202 Accepted and process asynchronously. Concurrent requests to the same index are batched automatically.

Search

MethodPathDescription
POST/indices/{name}/searchSearch with embedding arrays
POST/indices/{name}/search/filteredSearch + SQL metadata filter
POST/indices/{name}/search_with_encodingSearch with text queries
POST/indices/{name}/search/filtered_with_encodingText search + metadata filter

Metadata

MethodPathDescription
GET/indices/{name}/metadataGet all metadata entries
GET/indices/{name}/metadata/countCount metadata entries
POST/indices/{name}/metadata/checkCheck which doc IDs have metadata
POST/indices/{name}/metadata/queryGet doc IDs matching SQL condition
POST/indices/{name}/metadata/getGet metadata by IDs or SQL condition
POST/indices/{name}/metadata/updateUpdate metadata rows matching condition

Encoding & Reranking

MethodPathDescription
POST/encodeEncode texts to ColBERT embeddings
POST/rerankRerank with pre-computed embeddings (MaxSim)
POST/rerank_with_encodingRerank with text (server encodes + MaxSim)

Request & Response Examples

Create Index

POST /indices
{
  "name": "my_index",
  "config": {
    "nbits": 4,
    "batch_size": 50000,
    "seed": 42,
    "start_from_scratch": 999,
    "max_documents": 10000
  }
}

FieldDefaultDescription
nbits4Quantization bits (2 or 4)
batch_size50000Documents per indexing chunk
seednullRandom seed for K-means
start_from_scratch999Below this doc count, full rebuild on update
max_documentsnullEvict oldest when exceeded (null = unlimited)

Add Documents (text)

POST /indices/my_index/update_with_encoding
{
  "documents": [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany."
  ],
  "metadata": [{ "country": "France" }, { "country": "Germany" }],
  "pool_factor": 2
}

Returns 202 Accepted. The pool_factor reduces token count via hierarchical clustering (e.g. 2 = ~50% fewer embeddings per document).

Add Documents (embeddings)

POST /indices/my_index/update
{
  "documents": [
    {
      "embeddings": [
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
      ]
    },
    {
      "embeddings": [
        [0.7, 0.8, 0.9],
        [0.1, 0.2, 0.3]
      ]
    }
  ],
  "metadata": [{ "title": "Doc A" }, { "title": "Doc B" }]
}

Search (text)

POST /indices/my_index/search_with_encoding
{
  "queries": ["What is the capital of France?"],
  "params": { "top_k": 10 }
}

Response:

{
  "results": [
    {
      "query_id": 0,
      "document_ids": [0, 1],
      "scores": [18.42, 12.67],
      "metadata": [{ "country": "France" }, { "country": "Germany" }]
    }
  ],
  "num_queries": 1
}

Search with Filter

POST /indices/my_index/search/filtered_with_encoding
{
  "queries": ["capital city"],
  "params": { "top_k": 5 },
  "filter_condition": "country = ?",
  "filter_parameters": ["France"]
}

Search Parameters

ParameterDefaultDescription
top_k10Results to return per query
n_ivf_probe8IVF cells to probe per query token
n_full_scores4096Candidates for exact re-ranking
centroid_score_thresholdnullPrune low-scoring centroids (e.g. 0.4)

Delete Documents

DELETE /indices/my_index/documents
{
  "condition": "country = ? AND year < ?",
  "parameters": ["outdated", 2020]
}

Returns 202 Accepted. Deletes are batched: multiple delete requests within a short window are processed together.

Encode

POST /encode
{
  "texts": ["Paris is the capital of France."],
  "input_type": "document",
  "pool_factor": 2
}

Response:

{
  "embeddings": [[[0.1, 0.2, ...], [0.3, 0.4, ...]]],
  "num_texts": 1
}

input_type is "query" or "document". Queries use MASK token expansion. Documents filter padding tokens.

Rerank

POST /rerank_with_encoding
{
  "query": "What is the capital of France?",
  "documents": [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany."
  ],
  "pool_factor": null
}

Response:

{
  "results": [
    { "index": 0, "score": 15.23 },
    { "index": 1, "score": 8.12 }
  ],
  "num_documents": 2
}

Health

GET /health
{
  "status": "healthy",
  "version": "1.0.1",
  "loaded_indices": 1,
  "index_dir": "/data/indices",
  "memory_usage_bytes": 104857600,
  "indices": [
    {
      "name": "my_index",
      "num_documents": 1000,
      "num_embeddings": 50000,
      "num_partitions": 512,
      "dimension": 128,
      "nbits": 4,
      "avg_doclen": 50.0,
      "has_metadata": true
    }
  ],
  "model": {
    "name": "GTE-ModernColBERT-v1",
    "path": "/models/GTE-ModernColBERT-v1",
    "quantized": false,
    "embedding_dim": 128,
    "batch_size": 128,
    "num_sessions": 1,
    "query_prefix": "[Q] ",
    "document_prefix": "[D] ",
    "query_length": 48,
    "document_length": 300,
    "do_query_expansion": true
  }
}

Error Codes

All errors return JSON:

{
  "code": "ERROR_CODE",
  "message": "Human-readable description",
  "details": null
}

CodeHTTPWhen
INDEX_NOT_FOUND404Index does not exist
INDEX_ALREADY_EXISTS409Index name already taken
INDEX_NOT_DECLARED404Must POST /indices before updating
BAD_REQUEST400Invalid parameters
DIMENSION_MISMATCH400Embedding dim doesn't match index
METADATA_NOT_FOUND404No metadata database for this index
MODEL_NOT_LOADED400Encoding endpoint needs --model
MODEL_ERROR500ONNX inference failed
SERVICE_UNAVAILABLE503Queue full, retry later
RATE_LIMITED429Too many requests, requires RATE_LIMIT_ENABLED (retry after 2s)
INTERNAL_ERROR500Unexpected server error

Python SDK

pip install next-plaid-client

Both sync and async clients:

from next_plaid_client import NextPlaidClient, AsyncNextPlaidClient
from next_plaid_client import IndexConfig, SearchParams

# Sync
client = NextPlaidClient("http://localhost:8080")

# Async
client = AsyncNextPlaidClient("http://localhost:8080")
await client.search("docs", ["query"])

SDK Methods

MethodDescription
client.health()Health check
client.create_index(name, config)Create index
client.delete_index(name)Delete index
client.get_index(name)Get index info
client.list_indices()List all indices
client.add(name, documents, metadata)Add documents (text or embeddings)
client.search(name, queries, params, filter_condition, filter_parameters)Search
client.delete(name, condition, parameters)Delete by filter
client.encode(texts, input_type, pool_factor)Encode texts
client.rerank(query, documents)Rerank documents

Docker

Images

# CPU (amd64 + arm64)
docker pull ghcr.io/lightonai/next-plaid:cpu-1.0.6

# CUDA (amd64, requires NVIDIA GPU)
docker pull ghcr.io/lightonai/next-plaid:cuda-1.0.6

The Docker entrypoint auto-downloads HuggingFace models. Pass org/model as --model and it handles the rest. Set HF_TOKEN for private models.

Docker Compose (CPU)

services:
  next-plaid-api:
    image: ghcr.io/lightonai/next-plaid:cpu-1.0.6
    ports:
      - "8080:8080"
    volumes:
      - ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
      - ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
    environment:
      - RUST_LOG=info
    command:
      - --host
      - "0.0.0.0"
      - --port
      - "8080"
      - --index-dir
      - /data/indices
      - --model
      - lightonai/answerai-colbert-small-v1-onnx
      - --int8
      - --parallel
      - "16"
      - --batch-size
      - "4"
    healthcheck:
      test:
        ["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 2
      start_period: 120s
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 16G

Docker Compose (CUDA)

services:
  next-plaid-api:
    image: ghcr.io/lightonai/next-plaid:cuda-1.0.6
    ports:
      - "8080:8080"
    volumes:
      - ${NEXT_PLAID_DATA:-~/.local/share/next-plaid}:/data/indices
      - ${NEXT_PLAID_MODELS:-~/.cache/huggingface/next-plaid}:/models
    environment:
      - RUST_LOG=info
      - NVIDIA_VISIBLE_DEVICES=all
    command:
      - --host
      - "0.0.0.0"
      - --port
      - "8080"
      - --index-dir
      - /data/indices
      - --model
      - lightonai/GTE-ModernColBERT-v1
      - --cuda
      - --batch-size
      - "128"
    healthcheck:
      test:
        ["CMD", "curl", "-f", "--max-time", "5", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 2
      start_period: 120s
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 16G
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Volume Mounts

Host PathContainer PathPurpose
~/.local/share/next-plaid/data/indicesPersistent index storage
~/.cache/huggingface/next-plaid/modelsHuggingFace model cache

CLI Reference

next-plaid-api [OPTIONS]

FlagDefaultDescription
-h, --host0.0.0.0Bind address
-p, --port8080Bind port
-d, --index-dir./indicesIndex storage directory
-m, --model_(none)_ONNX model path or HuggingFace ID
--cudaoffCUDA for model inference
--int8offINT8 quantized model (~2x faster on CPU)
--parallel1Parallel ONNX sessions (recommended: 8-25 for throughput)
--batch-sizeautoBatch size per session (32 CPU, 64 GPU, 2 parallel)
--threadsautoThreads per ONNX session (auto: 1 when parallel)
--query-length48Max query length in tokens
--document-length300Max document length in tokens
--model-pool-size1Number of model worker instances for concurrent encoding
# Embeddings-only (no model)
next-plaid-api -p 3000 -d /data/indices

# CPU with model
next-plaid-api --model lightonai/answerai-colbert-small-v1-onnx --int8 --parallel 16

# GPU
next-plaid-api --model lightonai/GTE-ModernColBERT-v1 --cuda --batch-size 128

# Debug logging
RUST_LOG=debug next-plaid-api --model ./models/colbert

Architecture

flowchart TD
    subgraph API["REST API (Axum)"]
        H["/health"]
        I["/indices/*"]
        S["/search"]
        E["/encode"]
        R["/rerank"]
    end

    subgraph Middleware
        RL["Rate Limiter
token bucket · optional"] CL["Concurrency Limiter"] TR["Tracing
X-Request-ID"] TO["Timeout
30s health · 300s ops"] end subgraph Workers["Background Workers"] UQ["Update Batch Queue
per index"] DQ["Delete Batch Queue
per index"] EQ["Encode Batch Queue
global"] end subgraph Core["Core (next-plaid)"] NP["MmapIndex
IVF + PQ + MaxSim"] SQ["SQLite
Metadata Filtering"] end subgraph Model["Model (next-plaid-onnx)"] OX["ONNX Runtime
ColBERT Encoder"] end API --> Middleware I --> UQ I --> DQ E --> EQ UQ --> NP DQ --> NP UQ --> SQ DQ --> SQ S --> NP S --> SQ EQ --> OX R --> OX style API fill:#4a90d9,stroke:#357abd,color:#fff style Middleware fill:#50b86c,stroke:#3d9956,color:#fff style Workers fill:#e8913a,stroke:#d07a2e,color:#fff style Core fill:#9b59b6,stroke:#8445a0,color:#fff style Model fill:#e74c3c,stroke:#c0392b,color:#fff

Concurrency Design

The API uses lock-free reads and batched writes for high throughput:

flowchart LR
    R1["Request 1"] --> BQ["Batch Queue"]
    R2["Request 2"] --> BQ
    R3["Request 3"] --> BQ
    BQ -->|"collect until
300 docs or 100ms"| BW["Batch Worker"] BW -->|"acquire lock"| IDX["Index Update"] IDX --> META["Metadata Update"] META --> EVICT["Eviction Check"] EVICT --> RELOAD["Atomic Reload
(ArcSwap)"] style BQ fill:#e8913a,stroke:#d07a2e,color:#fff style BW fill:#e8913a,stroke:#d07a2e,color:#fff style IDX fill:#9b59b6,stroke:#8445a0,color:#fff style META fill:#9b59b6,stroke:#8445a0,color:#fff style EVICT fill:#9b59b6,stroke:#8445a0,color:#fff style RELOAD fill:#50b86c,stroke:#3d9956,color:#fff

Rate Limiting

Rate limiting is optional and disabled by default. Enable it by setting RATE_LIMIT_ENABLED=true. When enabled, the API applies a token bucket algorithm to a subset of routes:

ScopeRate limited?Why exempt
/health, /NoMonitoring must always work
GET /indices, GET /indices/{name}NoClients poll during async operations
POST /indices/{name}/update*NoHas per-index semaphore protection
DELETE /indices/{name}, DELETE /indices/{name}/documentsNoHas internal batching
/encode, /rerank*NoHas internal backpressure via queue
Everything elseYesStandard rate limiting

Environment Variables

Rate Limiting & Concurrency

VariableDefaultDescription
RATE_LIMIT_ENABLEDfalseEnable rate limiting (true, 1, or yes to enable)
RATE_LIMIT_PER_SECOND50Sustained requests/second (when enabled)
RATE_LIMIT_BURST_SIZE100Max burst size (when enabled)
CONCURRENCY_LIMIT100Max concurrent in-flight requests

Document Batching

VariableDefaultDescription
MAX_QUEUED_TASKS_PER_INDEX10Max pending updates per index (503 when full)
MAX_BATCH_DOCUMENTS300Documents per batch before processing
BATCH_CHANNEL_SIZE100Buffer for document batch queue

Encode Batching

VariableDefaultDescription
MAX_BATCH_TEXTS64Texts per encoding batch
ENCODE_BATCH_CHANNEL_SIZE256Buffer for encode batch queue

Delete Batching

VariableDefaultDescription
DELETE_BATCH_MIN_WAIT500Min wait (ms) after first delete before processing
DELETE_BATCH_MAX_WAIT2000Max wait (ms) for accumulating deletes
MAX_DELETE_BATCH_CONDITIONS200Max conditions per delete batch

Logging

VariableDefaultDescription
RUST_LOGinfoLog level (debug, info, warn, error)
HF_TOKEN_(none)_HuggingFace token for private model downloads

Feature Flags

FeatureDescription
_(default)_Core API, no BLAS, no model support
openblasOpenBLAS for matrix operations (Linux)
accelerateApple Accelerate (macOS)
modelONNX model encoding (/encode, *_with_encoding)
cudaCUDA acceleration (implies model)
# Embeddings-only API
cargo build --release -p next-plaid-api

# With model support (CPU, Linux)
cargo build --release -p next-plaid-api --features "openblas,model"

# With CUDA
cargo build --release -p next-plaid-api --features "cuda"

Modules

ModuleLinesDescription
handlers/documents1,638Index CRUD, update batching, delete batching, eviction, auto-repair
models759All request/response JSON schemas with OpenAPI annotations
handlers/encode549Encode worker pool, batch grouping by input type, ONNX inference
state488AppState, IndexSlot (ArcSwap), model pool, config caching
handlers/search449Search + filtered search, metadata enrichment, text-to-search pipeline
handlers/metadata484Metadata CRUD: check, query, get, count, update
handlers/rerank292ColBERT MaxSim scoring, text and embedding reranking
error138Error types with HTTP status code mapping
tracing_middleware115Request tracing via X-Request-ID header
main887CLI argument parsing, router construction, Swagger UI, server startup
lib44PrettyJson response type, module re-exports

Dependencies

CratePurpose
next-plaidCore PLAID index (IVF + PQ + MaxSim)
next-plaid-onnxColBERT ONNX encoding (optional)
axum 0.8Web framework
tokioAsync runtime
tower / tower-httpMiddleware (CORS, tracing, timeout, concurrency)
tower_governorRate limiting (token bucket)
utoipa / utoipa-swagger-uiOpenAPI generation + Swagger UI
arc-swapLock-free index swapping
parking_lotFast read-write locks
sysinfoProcess memory usage for /health
uuidRequest trace IDs
ndarrayN-dimensional arrays
serde / serde_jsonSerialization

License

Apache-2.0