-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate llama-index parsers #51
Comments
If you have any complex pdf file, do test it and lmk the results. If results are satisfactory I'll add a PR for this. |
llama-index has https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/ that is part of the open source offerings, and it can read a whole pile of file formats:
It happily slurps up a whole directory tree with subdirectories and all if you ask it. Not sure exactly what it does with audio/video/images. Whisper speech-to-text and OCR? |
I'm thinking now with the plugin arch this could be done in a separate package "docetl-llama-index-parsers" or some such if it's too off topic for the main library. |
I am also starting to wonder whether a plugin system for file-readers might be a good idea. I appreciate this might be an early optimisation though... 🤔 |
@AntoineDao So, I already built a plugin system, and the PR above could easily be moved to a separate repo. However, the API for parsers is a bit limited, see #72, and it would be good to address that first... I think parsing and loaders is where there is a near infinite set of libraries that could be useful to integrate with, and doing so directly in this repo would end up very messy. Hence plugins :) |
I agree that we'll eventually want a plugin system, but it currently feels a bit premature to do so... If anyone has a concrete use case, please react and/or comment here. We can bump up the priority for this |
Well, the entrypoint stuff I wrote /is/ a simple plugin system... I think that's good enough for now. |
Wrap all llama-index parses to get easy access to a lot of file formats (pdfs, wikipedia etc)
The text was updated successfully, but these errors were encountered: