Refactoring all PDF loader and parser #28652

pprados · 2024-12-10T15:23:41Z

WIP

vercel · 2024-12-10T15:24:01Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 30, 2024 3:29pm

vercel · 2024-12-11T13:33:38Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

efriis · 2024-12-12T00:46:38Z

Hey @pprados! I don't think this work is going to result in a PR that is reviewable if it's only partially done and already adding 7000 lines.

What is your goal in this work?

pprados · 2024-12-13T12:37:15Z

@efriis

I'm well aware of that. That's why a meeting is to be organized with LangChain (via AXA France), normally next week, to see how best to proceed, with @eyurtsev.

We're sorry, it may take you several hours to validate it. The changes are important and cannot be published one after the other, as everything is linked. It's going to be difficult to cut the code into 12 successive PRs, and end up with the same result. And that's going to take months. All this work is validated by two matrix tests, ensuring the consistency of all modifications.

In order to qualify all the code, we worked on a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here. The only difference is the name to import classes.

We prepare the PR and its description. Look here to understand our work. We welcome any suggestions you may have to help us integrate it.

You can now pre-view the description. The final version won't be far off.

The aim is to submit the PR in early 2025.

efriis · 2024-12-13T21:47:33Z

ah got it - is there an issue or discussion of proposed changes? It might be easier to discuss ideas than these code changes

pprados · 2024-12-14T08:12:08Z

@efriis
90% of my customers work with PDF files and don't have a satisfactory solution at the moment. They cobble together solutions outside langchain (pdf processing outside loaders/parsers), sacrificing a good part of the benefits of this framework. Seeing the same problems over and over again, and the same bad solutions, I couldn't let them go on like that. I had to deal with the problem in the best possible way, for my customers and all LangChain users.

I've been funded by my client to simultaneously help projects in Belgium, Switzerland, Spain, Italy and France. I couldn't wait for a discussion on the subject. In the end, what I'm proposing is a no-brainer:

Integrate tables where possible
to indicate what you want to do with the images (invoke a multimodal LLM, for example)
standardize the various solutions to eventually enable automatic selection of the parser according to document characteristics

efriis · 2024-12-15T21:38:16Z

will let you and eugene discuss in the time you have scheduled.

efriis · 2024-12-19T17:21:30Z

hey @pprados! Would it be ok if we closed this until it was ready for review in january? If you need a PR for tracking changes while you're working, would recommend opening one against your own fork's master branch

Let me know! We're cleaning up our review process in target of faster review times and generally improved contributor experiences, and one of the aspects of it is trying to keep the inbox clear.

vercel bot had a problem deploying to Preview December 10, 2024 15:36 Failure

vercel bot had a problem deploying to Preview December 10, 2024 17:33 Failure

vercel bot had a problem deploying to Preview December 11, 2024 06:59 Failure

vercel bot had a problem deploying to Preview December 11, 2024 07:54 Failure

vercel bot had a problem deploying to Preview December 11, 2024 08:26 Failure

vercel bot had a problem deploying to Preview December 11, 2024 09:57 Failure

vercel bot had a problem deploying to Preview December 11, 2024 10:53 Failure

vercel bot had a problem deploying to Preview December 11, 2024 11:18 Failure

vercel bot deployed to Preview December 11, 2024 13:21 View deployment

vercel bot deployed to Preview December 11, 2024 13:47 View deployment

vercel bot deployed to Preview December 11, 2024 13:56 View deployment

vercel bot deployed to Preview December 11, 2024 16:32 View deployment

efriis self-assigned this Dec 12, 2024

vercel bot deployed to Preview December 12, 2024 11:07 View deployment

vercel bot deployed to Preview December 13, 2024 12:26 View deployment

vercel bot deployed to Preview December 13, 2024 12:49 View deployment

vercel bot deployed to Preview December 16, 2024 07:07 View deployment

vercel bot had a problem deploying to Preview December 16, 2024 17:22 Failure

efriis assigned eyurtsev and unassigned efriis Dec 16, 2024

vercel bot had a problem deploying to Preview December 19, 2024 09:57 Failure

vercel bot had a problem deploying to Preview December 19, 2024 10:27 Failure

vercel bot had a problem deploying to Preview December 19, 2024 10:53 Failure

vercel bot had a problem deploying to Preview December 19, 2024 11:10 Failure

vercel bot had a problem deploying to Preview December 19, 2024 11:25 Failure

vercel bot had a problem deploying to Preview December 19, 2024 12:42 Failure

vercel bot had a problem deploying to Preview December 19, 2024 12:51 Failure

vercel bot had a problem deploying to Preview December 20, 2024 07:48 Failure

vercel bot had a problem deploying to Preview December 20, 2024 07:55 Failure

vercel bot had a problem deploying to Preview December 20, 2024 08:21 Failure

vercel bot had a problem deploying to Preview December 20, 2024 09:04 Failure

vercel bot had a problem deploying to Preview December 30, 2024 09:39 Failure

vercel bot had a problem deploying to Preview December 30, 2024 10:05 Failure

vercel bot deployed to Preview December 30, 2024 13:12 View deployment

pprados force-pushed the pprados/refactor_pdf_loaders branch from de2b623 to 7a0a9ec Compare December 30, 2024 13:24

vercel bot deployed to Preview December 30, 2024 13:33 View deployment

pprados force-pushed the pprados/refactor_pdf_loaders branch from 7a0a9ec to 8f5d453 Compare December 30, 2024 13:38

vercel bot deployed to Preview December 30, 2024 14:04 View deployment

vercel bot deployed to Preview December 30, 2024 15:09 View deployment

pprados force-pushed the pprados/refactor_pdf_loaders branch 2 times, most recently from e80fe2f to ee99b86 Compare December 30, 2024 15:16

Refactoring all PDF loader and parser

24b1add

pprados force-pushed the pprados/refactor_pdf_loaders branch from ee99b86 to 24b1add Compare December 30, 2024 15:21

vercel bot deployed to Preview December 30, 2024 15:29 View deployment

pprados closed this Dec 30, 2024

pprados deleted the pprados/refactor_pdf_loaders branch December 30, 2024 15:31

pprados restored the pprados/refactor_pdf_loaders branch December 30, 2024 15:32

pprados deleted the pprados/refactor_pdf_loaders branch December 30, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring all PDF loader and parser #28652

Refactoring all PDF loader and parser #28652

pprados commented Dec 10, 2024

vercel bot commented Dec 10, 2024 •

edited

Loading

vercel bot commented Dec 11, 2024

efriis commented Dec 12, 2024

pprados commented Dec 13, 2024

efriis commented Dec 13, 2024

pprados commented Dec 14, 2024 •

edited

Loading

efriis commented Dec 15, 2024

efriis commented Dec 19, 2024

Refactoring all PDF loader and parser #28652

Refactoring all PDF loader and parser #28652

Conversation

pprados commented Dec 10, 2024

vercel bot commented Dec 10, 2024 • edited Loading

vercel bot commented Dec 11, 2024

efriis commented Dec 12, 2024

pprados commented Dec 13, 2024

efriis commented Dec 13, 2024

pprados commented Dec 14, 2024 • edited Loading

efriis commented Dec 15, 2024

efriis commented Dec 19, 2024

vercel bot commented Dec 10, 2024 •

edited

Loading

pprados commented Dec 14, 2024 •

edited

Loading