documents¶
Upload documents to DuckDB, create necessary schema, and index using BM25.
Parameters¶
-
database (str)
Name of the DuckDB database.
-
key (str)
Key identifier for the documents. The key will be renamed to
idin the database. -
fields (str | list[str])
List of fields to upload from each document. If a single field is provided as a string, it will be converted to a list.
-
documents (list[dict] | str)
Documents to upload. Can be a list of dictionaries or a Hugging Face (HF) URL string pointing to a dataset.
-
k1 (float) – defaults to
1.5BM25 k1 parameter, controls term saturation.
-
b (float) – defaults to
0.75BM25 b parameter, controls document length normalization.
-
stemmer (str) – defaults to
porterStemming algorithm to use (e.g., 'porter'). The type of stemmer to be used. One of 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil', 'turkish', or 'none' if no stemming is to be used.
-
stopwords (str | list[str]) – defaults to
NoneList of stopwords to exclude from indexing. Can be a custom list or a language string.
-
ignore (str) – defaults to
(\.|[^a-z])+Regular expression pattern to ignore characters when indexing. Default ignore punctuation and non-alphabetic characters.
-
strip_accents (bool) – defaults to
TrueWhether to remove accents from characters during indexing.
-
lower (bool) – defaults to
True -
batch_size (int) – defaults to
30000Number of documents to process per batch.
-
n_jobs (int) – defaults to
-1Number of parallel jobs to use for uploading documents. Default use all available processors.
-
dtypes (dict[str, str] | None) – defaults to
None -
config (dict | None) – defaults to
NoneOptional configuration dictionary for the DuckDB connection and other settings.
-
limit (int | None) – defaults to
None -
tqdm_bar (bool) – defaults to
TrueWhether to display a progress bar when uploading documents