pylate-onnx-export

Export HuggingFace ColBERT models to ONNX format for high-performance inference.

Installation

pip install pylate-onnx-export

Requirements: Python 3.10-3.12

Dependencies:


CLI Usage

Export a Model

# Export ColBERT model to ONNX (creates both FP32 and INT8 quantized versions by default)
pylate-onnx-export lightonai/GTE-ModernColBERT-v1

# Export FP32 only (skip quantization)
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 --no-quantize

# Export to custom directory
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 -o ./my-models

# Force re-export (overwrites existing)
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 --force

Push to HuggingFace Hub

# Export and push to Hub
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 --push-to-hub myorg/my-onnx-model

# Push as private repository
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 --push-to-hub myorg/my-model --private

Quantize Existing Model

colbert-quantize ./models/GTE-ModernColBERT-v1

CLI Reference

pylate-onnx-export

Usage: pylate-onnx-export [OPTIONS] MODEL_NAME

Arguments:
  MODEL_NAME              HuggingFace model name or local path

Options:
  -o, --output-dir DIR    Output directory (default: ./models/<model-name>)
  --no-quantize           Skip INT8 quantization (by default, both FP32 and INT8 are created)
  -f, --force             Force re-export even if exists
  --push-to-hub REPO_ID   Push to HuggingFace Hub
  --private               Make Hub repository private
  --quiet                 Suppress progress messages
  --version               Show version

colbert-quantize

Usage: colbert-quantize [OPTIONS] MODEL_DIR

Arguments:
  MODEL_DIR               Directory containing model.onnx

Options:
  --quiet                 Suppress progress messages

Python API

export_model

Export a ColBERT model from HuggingFace to ONNX format.

from colbert_export import export_model

output_dir = export_model(
    model_name="lightonai/GTE-ModernColBERT-v1",
    output_dir="./models",      # Optional, defaults to ./models/<model-name>
    quantize=True,              # Also create INT8 model
    verbose=True,               # Print progress
    force=False,                # Skip if exists
)

Parameters:

ParameterTypeDefaultDescription
model_namestrrequiredHuggingFace model name or local path
output_dirPath \| NoneNoneOutput directory
quantizeboolFalseCreate INT8 quantized version
verboseboolTruePrint progress messages
forceboolFalseRe-export even if exists
Returns: Path to output directory

quantize_model

Apply INT8 dynamic quantization to an existing ONNX model.

from colbert_export import quantize_model

quantized_path = quantize_model(
    model_dir="./models/GTE-ModernColBERT-v1",
    verbose=True,
)

Parameters:

ParameterTypeDefaultDescription
model_dirPathrequiredDirectory containing model.onnx
verboseboolTruePrint progress messages
Returns: Path to quantized model (model_int8.onnx)

Quantization benefits:

push_to_hub

Push an exported ONNX model to HuggingFace Hub.

from colbert_export import push_to_hub

repo_url = push_to_hub(
    model_dir="./models/GTE-ModernColBERT-v1",
    repo_id="myorg/my-onnx-model",
    private=False,
    verbose=True,
)

Parameters:

ParameterTypeDefaultDescription
model_dirPathrequiredDirectory containing exported model
repo_idstrrequiredHuggingFace Hub repository ID
privateboolFalseMake repository private
verboseboolTruePrint progress messages
Returns: str URL of the uploaded model

Uploaded files:


Output Structure

models/<model-name>/
├── model.onnx                        # FP32 ONNX model
├── model_int8.onnx                   # INT8 quantized (created by default)
├── tokenizer.json                    # HuggingFace fast tokenizer
└── onnx_config.json                  # Model configuration

onnx_config.json Schema

{
  "model_type": "ColBERT",
  "model_name": "lightonai/GTE-ModernColBERT-v1",
  "model_class": "ModernBertModel",
  "uses_token_type_ids": false,
  "query_prefix": "[Q] ",
  "document_prefix": "[D] ",
  "query_length": 32,
  "document_length": 180,
  "do_query_expansion": true,
  "attend_to_expansion_tokens": false,
  "skiplist_words": [".", ",", "!", "?", "..."],
  "embedding_dim": 128,
  "mask_token_id": 50264,
  "pad_token_id": 50283,
  "query_prefix_id": 50281,
  "document_prefix_id": 50282,
  "do_lower_case": false
}

FieldTypeDescription
model_typestrAlways "ColBERT"
model_namestrSource HuggingFace model name
model_classstrTransformer class (e.g., ModernBertModel)
uses_token_type_idsboolWhether model uses token type IDs (BERT: true, ModernBERT: false)
query_prefixstrPrefix for queries (e.g., "[Q] ")
document_prefixstrPrefix for documents (e.g., "[D] ")
query_lengthintMaximum query sequence length
document_lengthintMaximum document sequence length
do_query_expansionboolExpand queries with MASK tokens
skiplist_wordslist[str]Punctuation tokens to filter from documents
embedding_dimintOutput embedding dimension
mask_token_idintMASK token ID for query expansion
pad_token_idintPAD token ID
query_prefix_idintToken ID for query prefix
document_prefix_idintToken ID for document prefix
do_lower_caseboolLowercase text before tokenization

ONNX Model Specification

Inputs

NameShapeTypeDescription
input_ids[batch, seq_len]int64Tokenized input IDs
attention_mask[batch, seq_len]int64Attention mask (1=attend, 0=pad)
token_type_ids[batch, seq_len]int64Token type IDs (BERT only)
Note: token_type_ids is only present for BERT-based models. ModernBERT models do not use this input.

Output

NameShapeTypeDescription
output[batch, seq_len, dim]float32L2-normalized token embeddings

Export Details


Export Pipeline

The export process:

HuggingFace Model
       ↓
PyLate ColBERT (adds [Q]/[D] tokens, extends embeddings)
       ↓
ColBERTForONNX wrapper
  ├── Transformer backbone
  ├── Linear projection layer(s)
  └── L2 normalization
       ↓
torch.onnx.export (opset 14, dynamic axes)
       ↓
ONNX verification
       ↓
INT8 quantization (default)

Supported Models

Any PyLate-compatible ColBERT model from HuggingFace including onnx ready models:

lightonai/GTE-ModernColBERT-v1 (text retrieval, lightweight)
lightonai/answerai-colbert-small-v1-onnx (text retrieval, more accurate)
lightonai/LateOn-Code-edge (code search, lightweight)
lightonai/LateOn-Code (code search, more accurate)

License

Apache-2.0