Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring all PDF loader and parser #28652

Closed

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Dec 10, 2024

WIP

Copy link

vercel bot commented Dec 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 30, 2024 3:29pm

Copy link

vercel bot commented Dec 11, 2024

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@efriis
Copy link
Member

efriis commented Dec 12, 2024

Hey @pprados! I don't think this work is going to result in a PR that is reviewable if it's only partially done and already adding 7000 lines.

What is your goal in this work?

@pprados
Copy link
Contributor Author

pprados commented Dec 13, 2024

@efriis

I'm well aware of that. That's why a meeting is to be organized with LangChain (via AXA France), normally next week, to see how best to proceed, with @eyurtsev.

We're sorry, it may take you several hours to validate it. The changes are important and cannot be published one after the other, as everything is linked. It's going to be difficult to cut the code into 12 successive PRs, and end up with the same result. And that's going to take months. All this work is validated by two matrix tests, ensuring the consistency of all modifications.

In order to qualify all the code, we worked on a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here. The only difference is the name to import classes.

We prepare the PR and its description. Look here to understand our work. We welcome any suggestions you may have to help us integrate it.

You can now pre-view the description. The final version won't be far off.

The aim is to submit the PR in early 2025.

@efriis
Copy link
Member

efriis commented Dec 13, 2024

ah got it - is there an issue or discussion of proposed changes? It might be easier to discuss ideas than these code changes

@pprados
Copy link
Contributor Author

pprados commented Dec 14, 2024

@efriis
90% of my customers work with PDF files and don't have a satisfactory solution at the moment. They cobble together solutions outside langchain (pdf processing outside loaders/parsers), sacrificing a good part of the benefits of this framework. Seeing the same problems over and over again, and the same bad solutions, I couldn't let them go on like that. I had to deal with the problem in the best possible way, for my customers and all LangChain users.

I've been funded by my client to simultaneously help projects in Belgium, Switzerland, Spain, Italy and France. I couldn't wait for a discussion on the subject. In the end, what I'm proposing is a no-brainer:

  • Integrate tables where possible
  • to indicate what you want to do with the images (invoke a multimodal LLM, for example)
  • standardize the various solutions to eventually enable automatic selection of the parser according to document characteristics

@efriis
Copy link
Member

efriis commented Dec 15, 2024

will let you and eugene discuss in the time you have scheduled.

@efriis
Copy link
Member

efriis commented Dec 19, 2024

hey @pprados! Would it be ok if we closed this until it was ready for review in january? If you need a PR for tracking changes while you're working, would recommend opening one against your own fork's master branch

Let me know! We're cleaning up our review process in target of faster review times and generally improved contributor experiences, and one of the aspects of it is trying to keep the inbox clear.

@pprados pprados force-pushed the pprados/refactor_pdf_loaders branch from ee99b86 to 24b1add Compare December 30, 2024 15:21
@pprados pprados closed this Dec 30, 2024
@pprados pprados deleted the pprados/refactor_pdf_loaders branch December 30, 2024 15:31
@pprados pprados restored the pprados/refactor_pdf_loaders branch December 30, 2024 15:32
@pprados pprados deleted the pprados/refactor_pdf_loaders branch December 30, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

3 participants