Skip to content

Search

Note

Before we can search for documents, we need to upload them to DuckDB. We can use the upload.documents function to upload a list of dictionaries to DuckDB.

All the search functions require a DuckDB database name as the first argument. The database name is the name of the DuckDB database where the documents are stored. The database name is the same as the one used in the upload.documents function. Each search function can take additional parameters to control the search behavior such as the number of documents to return, the number of documents to score for each query token, and the number of parallel jobs to use as well as optional SQL filters.

Documents

Once the documents are uploaded, we can search for them using the search.documents function. The search function returns a list of list of documents ordered by their BM25 score.

search.documents(
    database="ducksearch.duckdb",
    queries=["daft punk", "rock"],
    top_k=10,
    top_k_token=10_000,
    batch_size=32,
    n_jobs=-1,
)
[
    [
        {
            "id": "2",
            "title": "Alive",
            "style": "electro, punk",
            "date": Timestamp("2007-11-19 00:00:00"),
            "popularity": 9,
            "score": 0.16131360828876495,
        }
    ],
    [
        {
            "id": "1",
            "title": "Here Comes the Sun",
            "style": "rock",
            "date": Timestamp("1969-06-10 00:00:00"),
            "popularity": 10,
            "score": 0.09199773520231247,
        },
        {
            "id": "0",
            "title": "Hotel California",
            "style": "rock",
            "date": Timestamp("1977-02-22 00:00:00"),
            "popularity": 9,
            "score": 0.07729987800121307,
        },
    ],
]
Info

The search function is executed in parallel using the n_jobs parameter. We can control the number of documents to return using the top_k parameter and the number of documents to score for each query token using the top_k_token parameter. Reducing top_k_token can further speed up the search but may result in lower quality results.

Filters

We can apply filters to the search using the filters parameter. The filters are SQL expressions that are applied to the search results.

from ducksearch import search

search.documents(
    database="ducksearch.duckdb",
    queries=["rock", "california"],
    top_k=10,
    top_k_token=10_000,
    batch_size=32,
    filters="YEAR(date) <= 1990 AND YEAR(date) >= 1970",
    n_jobs=-1,
)
[
    [
        {
            "score": 0.07729987800121307,
            "id": "0",
            "title": "Hotel California",
            "style": "rock",
            "date": Timestamp("1977-02-22 00:00:00"),
            "popularity": 9,
        }
    ],
    [
        {
            "score": 0.16131360828876495,
            "id": "0",
            "title": "Hotel California",
            "style": "rock",
            "date": Timestamp("1977-02-22 00:00:00"),
            "popularity": 9,
        }
    ],
]
Info

The filters are evaluated by DuckDB, so all DuckDB functions are available for use in the filters. You can find more information about DuckDB functions in the DuckDB documentation.