pylate-rs
LightOn
PyLate, is a powerful tool for research and
training with ColBERT developped at LightOn. It carries a
heavy set of
dependencies. That's fine for most environments and especially to train state-of-the-art information
retrieval models, but it can be a real headache when you just want to run inference in a live
application and spawn your model in milliseconds.
That's why we built pylate-rs.
The main
difference is that we've completely removed the PyTorch and Transformers
dependencies. Instead, we
went a different route and built it with Candle,
the deep-learning crate made with Rust.
The goal was to create a focused, lightweight tool that does one thing well: compute ColBERT
embeddings.
Time to import pylate_rs
, intialize the model on the target device and start computing embeddings with Python:
CPU
-97%CUDA (GPU)
-46%MPS (Apple sillicon)
-96%
Along with the Python release, we uploaded a dedicated crate that can be used in any Rust project, and also to compile it to WebAssembly for use in the browser.
If you're not familiar with ColBERT, it's a model for computing sentence embeddings. As an
encoder-based model, it
generates an embedding for each token in a sentence. The output from the final
Transformer layer is
a matrix of
shape (embedding_dimension, num_tokens)
, for instance, 768 x num_tokens
.
A linear layer then reduces the embedding dimension, resulting in a
128 x num_tokens
matrix. In
contrast, sentence transformers don't output per-token embeddings. Instead, they aggregate all token
embeddings—a
process called pooling—using methods like mean or max pooling. This produces a single vector
representation for the
entire sentence, with a shape like 768 x 1
.
ColBERT often outperforms sentence transformers
because it allows for
more fine-grained weights updates during training. If the model generates a poor
representation for
a specific
token, the weight adjustments can target that specific token's embedding. This is unlike standard
models that update
the entire sentence representation, even if only a small part was incorrect.
This per-token approach enables a powerful "late interaction" mechanism. Instead of
comparing two
fixed sentence
vectors, ColBERT calculates the similarity between each query token and all document
tokens. These
fine-grained scores are then aggregated to determine the final relevance score.
Formally, given a query Q with token embeddings qi and a document D
with token embeddings dj,
the MaxSim score is calculated by finding the maximum similarity for each query token across
all document
tokens, and then summing these maximums:
S(Q, D) =
|Q|
∑
i=1
max
1 ≤ j ≤ |D|
(qi ⋅ dj)
Every interactive charts below run in the browser using WebAssembly with the wasm bindings of pylate-rs.
Please select a model to begin.
Similarity Results
As you can see, the MaxSim operation is a sum of maximum similarities, not an
average. Consequently,
the final score is
not bounded within a fixed range like [0, 1]
. Its magnitude scales with the number of
tokens in the
query, making it
difficult to apply a universal similarity threshold. The score's scale depends also on the specific
query and document
context. A specific field might in average yield higher scores than another one.
This token-centric design also allows for visualizing the interactions between a query and a
document. In our
implementation, the token embeddings are L2-normalized. As a result, the similarity score for any
token pair is
bounded. The overall relevance score is then computed by
summing the maximum similarity score for each query token over all document tokens.
Waiting for model to load...
By summing the similarity scores of the document tokens that contribute to the MaxSim calculation,
we can visualize the
weight of each token in the final score. These are the specific document tokens that had the highest
similarity for each
corresponding query token. The visualization is freely inspired from the excellent Jo Kristian Bergum demo.
However, it's crucial to remember that the token embeddings are contextualized.
This means a token
can indirectly
influence the score even if it is not highlighted as a top-scoring match. It does so by altering the
embeddings of its
neighboring tokens, which may be the ones that are directly measured. The final score is the sum of
the scores from the
highlighted tokens, but their values are shaped by the entire context.
Enter a query and document above to see the visualization.
It may seem paradoxical to praise ColBERT for its token-level
granularity only to
then find ways to reduce the number of token embeddings we use. In reality, this reflects a
practical
trade-off between computational cost and representational
detail.
pylate-rs
implements a token reduction strategy following the article
from Benjamin Clavié and Antoine Chaffin. The core idea is to find a balance between
using all tokens
and using
only the most
salient ones.
Use the slider below to adjust this pooling factor. You will see how it simplifies the document
representation and
affects the final similarity score, illustrating the direct trade-off between
performance and
accuracy.
Original Score
-
Pooled Score
-
Waiting for model to load...
Original Embeddings
Tokens: 0
Pooled Embeddings
Tokens: 0
In June 2O25, we released fast-plaid at LightOn, a Rust
implementation of the PLAID algorithm for
efficient nearest-neighbor search. Paired with pylate-rs
, fast-plaid
offers a
lightweight solution for running ColBERT as a retriever in Python. Currently, fast-plaid is
immutable, meaning the
index must be rebuilt to add new documents. For use cases requiring mutable indexes, we recommend
exploring
solutions like Weaviate's
implementation of MUVERA.
We plan to add mutability and filtering to fast-plaid in the future. Any contribution is welcome!
Here is a sample code for running ColBERT with pylate-rs and fast-plaid. This is the fastest
way
to create a multi-vectors index apart from calling pylate-rs from rust ⚡️. It's
compatible with CUDA and CPUs and batch-oriented.
import torch
from fast_plaid import search
from pylate_rs import models
model = models.ColBERT(
model_name_or_path="lightonai/GTE-ModernColBERT-v1",
device="cpu", # mps or cuda
)
documents = [
"1st Arrondissement: Louvre, Tuileries Garden, Palais Royal, historic, tourist.",
"2nd Arrondissement: Bourse, financial, covered passages, Sentier, business.",
"3rd Arrondissement: Marais, Musée Picasso, galleries, trendy, historic.",
"4th Arrondissement: Notre-Dame, Marais, Hôtel de Ville, LGBTQ+.",
"5th Arrondissement: Latin Quarter, Sorbonne, Panthéon, student, intellectual.",
"6th Arrondissement: Saint-Germain-des-Prés, Luxembourg Gardens, chic, artistic, cafés.",
"7th Arrondissement: Eiffel Tower, Musée d'Orsay, Les Invalides, affluent, prestigious.",
"8th Arrondissement: Champs-Élysées, Arc de Triomphe, luxury, shopping, Élysée.",
"9th Arrondissement: Palais Garnier, department stores, shopping, theaters.",
"10th Arrondissement: Gare du Nord, Gare de l'Est, Canal Saint-Martin.",
"11th Arrondissement: Bastille, nightlife, Oberkampf, revolutionary, hip.",
"12th Arrondissement: Bois de Vincennes, Opéra Bastille, Bercy, residential.",
"13th Arrondissement: Chinatown, Bibliothèque Nationale, modern, diverse, street-art.",
"14th Arrondissement: Montparnasse, Catacombs, residential, artistic, quiet.",
"15th Arrondissement: Residential, family, populous, Parc André Citroën.",
"16th Arrondissement: Trocadéro, Bois de Boulogne, affluent, elegant, embassies.",
"17th Arrondissement: Diverse, Palais des Congrès, residential, Batignolles.",
"18th Arrondissement: Montmartre, Sacré-Cœur, Moulin Rouge, artistic, historic.",
"19th Arrondissement: Parc de la Villette, Cité des Sciences, canals, diverse.",
"20th Arrondissement: Père Lachaise, Belleville, cosmopolitan, artistic, historic.",
]
# Encoding documents
documents_embeddings = model.encode(
sentences=documents,
is_query=False,
pool_factor=2, # Let's divide the number of embeddings by 2.
)
# Creating the FastPlaid index
fast_plaid = search.FastPlaid(index="index")
fast_plaid.create(
documents_embeddings=[torch.tensor(embedding) for embedding in documents_embeddings]
)
We can then load the existing index and search for the most relevant documents:
import torch
from fast_plaid import search
from pylate_rs import models
fast_plaid = search.FastPlaid(index="index")
queries = [
"arrondissement with the Eiffel Tower and Musée d'Orsay",
"Latin Quarter and Sorbonne University",
"arrondissement with Sacré-Cœur and Moulin Rouge",
"arrondissement with the Louvre and Tuileries Garden",
"arrondissement with Notre-Dame Cathedral and the Marais",
]
queries_embeddings = model.encode(
sentences=queries,
is_query=True,
)
scores = fast_plaid.search(
queries_embeddings=torch.tensor(queries_embeddings),
top_k=3,
)
print(scores)
At LightOn, we are developing Generative Models, Encoders,
ColBERT models and state-of-the-art
RAG pipelines. We released PyLate, an optimized solution for training ColBERT
models on hardware ranging from a single CPU to a multi-GPU node.
In partnership with AnswerAI, LightOn released ModernBERT, a new
encoder state-of-the-art encoder. We later fine-tuned it to
create a state-of-the-art ColBERT: GTE-ModernColBERT.
LightOn also released Reason-ModernColBERT,
which achieves state-of-the-art results on
the BRIGHT
benchmark. Reason-ModernColBERT outperforms models 45x larger on the gold standard for
reasoning-intensive
retrieval. Both GTE-ModernColBERT and Reason-ModernColBERT models were trained with
PyLate
and are compatible with pylate-rs
.
You can find all compatible models on the Hugging Face Hub under the PyLate tag.
For more information, visit the PyLate and pylate-rs repositories on GitHub and leave a ⭐️
if you find them useful!
Raphael Sourty, AI @ LightOn - July 3, 2025
PyLate is being built with my amazing co-maintainer, Antoine Chaffin.