Integrate llama-index parsers #51

redhog · 2024-10-02T22:53:42Z

Wrap all llama-index parses to get easy access to a lot of file formats (pdfs, wikipedia etc)

staru09 · 2024-10-04T14:59:51Z

As far as I know llama_index doesn't offer any other parser apart from this.
Please correct me if I am wrong.
Not sure of other file types but for PDF this can be used. Here is the sample code that converts a pdf file into a md file.
I tested it with this paper and here is the output_file.

If you have any complex pdf file, do test it and lmk the results. If results are satisfactory I'll add a PR for this.

redhog · 2024-10-04T17:59:43Z

llama-index has https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/ that is part of the open source offerings, and it can read a whole pile of file formats:

.csv - comma-separated values
.docx - Microsoft Word
.epub - EPUB ebook format
.hwp - Hangul Word Processor
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG image
.mbox - MBOX email archive
.md - Markdown
.mp3, .mp4 - audio and video
.pdf - Portable Document Format
.png - Portable Network Graphics
.ppt, .pptm, .pptx - Microsoft PowerPoint

It happily slurps up a whole directory tree with subdirectories and all if you ask it. Not sure exactly what it does with audio/video/images. Whisper speech-to-text and OCR?

redhog · 2024-10-04T18:00:38Z

I'm thinking now with the plugin arch this could be done in a separate package "docetl-llama-index-parsers" or some such if it's too off topic for the main library.

AntoineDao · 2024-10-05T20:27:05Z

I am also starting to wonder whether a plugin system for file-readers might be a good idea. I appreciate this might be an early optimisation though... 🤔

redhog · 2024-10-05T20:34:23Z

@AntoineDao So, I already built a plugin system, and the PR above could easily be moved to a separate repo.

However, the API for parsers is a bit limited, see #72, and it would be good to address that first...

I think parsing and loaders is where there is a near infinite set of libraries that could be useful to integrate with, and doing so directly in this repo would end up very messy. Hence plugins :)

shreyashankar · 2024-10-05T23:42:17Z

I agree that we'll eventually want a plugin system, but it currently feels a bit premature to do so...

If anyone has a concrete use case, please react and/or comment here. We can bump up the priority for this

redhog · 2024-10-06T07:50:28Z

Well, the entrypoint stuff I wrote /is/ a simple plugin system... I think that's good enough for now.

redhog mentioned this issue Oct 5, 2024

Added llama-index based parsers #71

Merged

redhog closed this as completed Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate llama-index parsers #51

Integrate llama-index parsers #51

redhog commented Oct 2, 2024

staru09 commented Oct 4, 2024

redhog commented Oct 4, 2024

redhog commented Oct 4, 2024

AntoineDao commented Oct 5, 2024

redhog commented Oct 5, 2024

shreyashankar commented Oct 5, 2024

redhog commented Oct 6, 2024

Integrate llama-index parsers #51

Integrate llama-index parsers #51

Comments

redhog commented Oct 2, 2024

staru09 commented Oct 4, 2024

redhog commented Oct 4, 2024

redhog commented Oct 4, 2024

AntoineDao commented Oct 5, 2024

redhog commented Oct 5, 2024

shreyashankar commented Oct 5, 2024

redhog commented Oct 6, 2024