Upload
Upload¶
When working with DuckSearch, the first step is to upload documents to DuckDB using the upload.documents
function. The documents are stored in a DuckDB database, and the fields are indexed with BM25. DuckSearch won't re-index a document if it already exists in the database. Index will be updated along with the new documents.
Upload documents¶
The following example demonstrates how to upload a list of documents:
from ducksearch import upload
documents = [
{
"id": 0,
"title": "Hotel California",
"style": "rock",
"date": "1977-02-22",
"popularity": 9,
},
{
"id": 1,
"title": "Here Comes the Sun",
"style": "rock",
"date": "1969-06-10",
"popularity": 10,
},
{
"id": 2,
"title": "Alive",
"style": "electro, punk",
"date": "2007-11-19",
"popularity": 9,
},
]
upload.documents(
database="ducksearch.duckdb",
key="id", # unique document identifier
fields=["title", "style", "date", "popularity"], # list of fields to index
documents=documents,
stopwords="english",
stemmer="porter",
lower=True,
strip_accents=True,
dtypes={
"date": "DATE",
"popularity": "INT",
},
)
Info
stopwords: List of stop words to filter Defaults to 'english' for a pre-defined list of 571 English stopwords.
stemmer: Stemmer to use. Defaults to 'porter' for the Porter stemmer. Possible values are: 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil', 'turkish', or None
if no stemming is to be used.
lower: Whether to convert the text to lowercase. Defaults to True
.
strip_accents: Whether to strip accents from the text. Defaults to True
.
HuggingFace¶
The upload.documents
function can also index HuggingFace datasets directly from the url.
The following example demonstrates how to index the FineWeb dataset from HuggingFace:
from ducksearch import upload
upload.documents(
database="fineweb.duckdb",
key="id",
fields=["text", "url", "date", "language", "token_count", "language_score"],
documents="https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/main/sample/10BT/000_00000.parquet",
dtypes={
"date": "DATE",
"token_count": "INT",
"language_score": "FLOAT",
},
limit=1000, # demonstrate with a small dataset
)