Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate llama-index parsers #51

Closed
redhog opened this issue Oct 2, 2024 · 7 comments · Fixed by #71
Closed

Integrate llama-index parsers #51

redhog opened this issue Oct 2, 2024 · 7 comments · Fixed by #71

Comments

@redhog
Copy link
Collaborator

redhog commented Oct 2, 2024

Wrap all llama-index parses to get easy access to a lot of file formats (pdfs, wikipedia etc)

@staru09
Copy link
Contributor

staru09 commented Oct 4, 2024

  1. As far as I know llama_index doesn't offer any other parser apart from this.
    Please correct me if I am wrong.

  2. Not sure of other file types but for PDF this can be used. Here is the sample code that converts a pdf file into a md file.
    I tested it with this paper and here is the output_file.

If you have any complex pdf file, do test it and lmk the results. If results are satisfactory I'll add a PR for this.

@redhog
Copy link
Collaborator Author

redhog commented Oct 4, 2024

llama-index has https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/ that is part of the open source offerings, and it can read a whole pile of file formats:

.csv - comma-separated values
.docx - Microsoft Word
.epub - EPUB ebook format
.hwp - Hangul Word Processor
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG image
.mbox - MBOX email archive
.md - Markdown
.mp3, .mp4 - audio and video
.pdf - Portable Document Format
.png - Portable Network Graphics
.ppt, .pptm, .pptx - Microsoft PowerPoint

It happily slurps up a whole directory tree with subdirectories and all if you ask it. Not sure exactly what it does with audio/video/images. Whisper speech-to-text and OCR?

@redhog
Copy link
Collaborator Author

redhog commented Oct 4, 2024

I'm thinking now with the plugin arch this could be done in a separate package "docetl-llama-index-parsers" or some such if it's too off topic for the main library.

@AntoineDao
Copy link

I am also starting to wonder whether a plugin system for file-readers might be a good idea. I appreciate this might be an early optimisation though... 🤔

@redhog
Copy link
Collaborator Author

redhog commented Oct 5, 2024

@AntoineDao So, I already built a plugin system, and the PR above could easily be moved to a separate repo.

However, the API for parsers is a bit limited, see #72, and it would be good to address that first...

I think parsing and loaders is where there is a near infinite set of libraries that could be useful to integrate with, and doing so directly in this repo would end up very messy. Hence plugins :)

@shreyashankar
Copy link
Collaborator

I agree that we'll eventually want a plugin system, but it currently feels a bit premature to do so...

If anyone has a concrete use case, please react and/or comment here. We can bump up the priority for this

@redhog
Copy link
Collaborator Author

redhog commented Oct 6, 2024

Well, the entrypoint stuff I wrote /is/ a simple plugin system... I think that's good enough for now.

@redhog redhog closed this as completed Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants