Skip to content

EdAbati/dataframes-haystack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dataframes Haystack

PyPI - Version PyPI - Python Version PyPI - License

Code style: black Ruff

GH Actions Tests pre-commit.ci status



πŸ“ƒ Description

dataframes-haystack is an extension for Haystack 2 that enables integration with dataframe libraries.

The dataframe libraries currently supported are:

The library offers various custom Converters components to transform dataframes into Haystack Document objects:

  • DataFrameFileToDocument is a main generic converter that reads files using a dataframe backend and converts them into Document objects.
  • FileToPandasDataFrame and FileToPolarsDataFrame read files and convert them into dataframes.
  • PandasDataFrameConverter or PolarsDataFrameConverter convert data stored in dataframes into Haystack Documentobjects.

dataframes-haystack supports reading files in various formats:

  • csv, json, parquet, excel, html, xml, orc, pickle, fixed-width format for pandas. See the pandas documentation for more details.
  • csv, json, parquet, excel, avro, delta, ipc for polars. See the polars documentation for more details.

πŸ› οΈ Installation

# for pandas (pandas is already included in `haystack-ai`)
pip install dataframes-haystack

# for polars
pip install "dataframes-haystack[polars]"

πŸ’» Usage

Tip

See the Example Notebooks for complete examples.

DataFrameFileToDocument

Complete example

You can leverage both pandas and polars backends (thanks to narwhals) to read your data!

from dataframes_haystack.components.converters import DataFrameFileToDocument

converter = DataFrameFileToDocument(content_column="text_str")
documents = converter.run(files=["file1.csv", "file2.csv"])
>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {}),
    Document(id=1, content: 'Hello everyone', meta: {})
]}

pandas Converters

Complete example

FileToPandasDataFrame

from dataframes_haystack.components.converters.pandas import FileToPandasDataFrame

converter = FileToPandasDataFrame(file_format="csv")

output_dataframe = converter.run(
    file_paths=["data/doc1.csv", "data/doc2.csv"]
)

Result:

>>> output_dataframe
{'dataframe': <pandas.DataFrame>}

PandasDataFrameConverter

import pandas as pd

from dataframes_haystack.components.converters.pandas import PandasDataFrameConverter

df = pd.DataFrame({
    "text": ["Hello world", "Hello everyone"],
    "filename": ["doc1.txt", "doc2.txt"],
})

converter = PandasDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)

Result:

>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
    Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}

polars Converters

Complete example

FileToPolarsDataFrame

from dataframes_haystack.components.converters.polars import FileToPolarsDataFrame

converter = FileToPolarsDataFrame(file_format="csv")

output_dataframe = converter.run(
    file_paths=["data/doc1.csv", "data/doc2.csv"]
)

Result:

>>> output_dataframe
{'dataframe': <polars.DataFrame>}

PolarsDataFrameConverter

import polars as pl

from dataframes_haystack.components.converters.polars import PolarsDataFrameConverter

df = pl.DataFrame({
    "text": ["Hello world", "Hello everyone"],
    "filename": ["doc1.txt", "doc2.txt"],
})

converter = PolarsDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)

Result:

>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
    Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}

🀝 Contributing

Do you have an idea for a new feature? Did you find a bug that needs fixing?

Feel free to open an issue or submit a PR!

Setup development environment

Requirements: hatch, pre-commit

  1. Clone the repository
  2. Run hatch shell to create and activate a virtual environment
  3. Run pre-commit install to install the pre-commit hooks. This will force the linting and formatting checks.

Run tests

  • Linting and formatting checks: hatch run lint:fmt
  • Unit tests: hatch run test-cov-all

✍️ License

dataframes-haystack is distributed under the terms of the MIT license.