-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add metadata to ingestion routes #2008
base: main
Are you sure you want to change the base?
Conversation
self, | ||
file_name: str, | ||
file_data: Path, | ||
file_metadata: dict[str, str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it is better idea to type like dict[str, Any] | None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I thought about because of llama index's restriction for documents: the metadata can only be a flat value (string, int, float, see https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents/#metadata). By using string I made it "fool proof". But lmk if you want Any, i will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is better idea try to convert data when we're storing data into vector store that compromise all application with this decision. What do you think @imartinez?
files = { | ||
"file": (path.name, path.open("rb")), | ||
"metadata": (None, json.dumps(metadata)), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below. Ideally, it would say it has to be something like this:
files = {
"file": (path.name, path.open("rb")),
"metadata": metadata
}
Description
This PR adds the ability to specify additional metadata in text and file ingestion routes.
This feature might be useful for people creating custom UIs with PrivateGPT and that want to add additional informations to the document sources, or for people using a custom ingestion pipeline with readers that may use the metadata. This feature could also be extended to bulk ingestion, but this would be a good start.
This is an important part of the project, so I wanted to make sure nothing breaks.
I added two unit tests, one for the text ingestion route and one for the file ingestion route.
I also tested all of the ingestion modes to make sure nothing breaks.
Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Test Configuration:
Checklist:
make check; make test
to ensure mypy and tests pass