dataframes-haystack
is an extension for Haystack 2 that enables integration with dataframe libraries.
The dataframe libraries currently supported are:
The library offers various custom Converters components to transform dataframes into Haystack Document
objects:
DataFrameFileToDocument
is a main generic converter that reads files using a dataframe backend and converts them intoDocument
objects.FileToPandasDataFrame
andFileToPolarsDataFrame
read files and convert them into dataframes.PandasDataFrameConverter
orPolarsDataFrameConverter
convert data stored in dataframes into HaystackDocument
objects.
dataframes-haystack
supports reading files in various formats:
- csv, json, parquet, excel, html, xml, orc, pickle, fixed-width format for
pandas
. See the pandas documentation for more details. - csv, json, parquet, excel, avro, delta, ipc for
polars
. See the polars documentation for more details.
# for pandas (pandas is already included in `haystack-ai`)
pip install dataframes-haystack
# for polars
pip install "dataframes-haystack[polars]"
Tip
See the Example Notebooks for complete examples.
You can leverage both pandas
and polars
backends (thanks to narwhals
) to read your data!
from dataframes_haystack.components.converters import DataFrameFileToDocument
converter = DataFrameFileToDocument(content_column="text_str")
documents = converter.run(files=["file1.csv", "file2.csv"])
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {}),
Document(id=1, content: 'Hello everyone', meta: {})
]}
from dataframes_haystack.components.converters.pandas import FileToPandasDataFrame
converter = FileToPandasDataFrame(file_format="csv")
output_dataframe = converter.run(
file_paths=["data/doc1.csv", "data/doc2.csv"]
)
Result:
>>> output_dataframe
{'dataframe': <pandas.DataFrame>}
import pandas as pd
from dataframes_haystack.components.converters.pandas import PandasDataFrameConverter
df = pd.DataFrame({
"text": ["Hello world", "Hello everyone"],
"filename": ["doc1.txt", "doc2.txt"],
})
converter = PandasDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)
Result:
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}
from dataframes_haystack.components.converters.polars import FileToPolarsDataFrame
converter = FileToPolarsDataFrame(file_format="csv")
output_dataframe = converter.run(
file_paths=["data/doc1.csv", "data/doc2.csv"]
)
Result:
>>> output_dataframe
{'dataframe': <polars.DataFrame>}
import polars as pl
from dataframes_haystack.components.converters.polars import PolarsDataFrameConverter
df = pl.DataFrame({
"text": ["Hello world", "Hello everyone"],
"filename": ["doc1.txt", "doc2.txt"],
})
converter = PolarsDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)
Result:
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}
Do you have an idea for a new feature? Did you find a bug that needs fixing?
Feel free to open an issue or submit a PR!
Requirements: hatch
, pre-commit
- Clone the repository
- Run
hatch shell
to create and activate a virtual environment - Run
pre-commit install
to install the pre-commit hooks. This will force the linting and formatting checks.
- Linting and formatting checks:
hatch run lint:fmt
- Unit tests:
hatch run test-cov-all
dataframes-haystack
is distributed under the terms of the MIT license.