Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MathpixPDFLoader never finishes #11249

Closed
2 of 14 tasks
fincha opened this issue Sep 30, 2023 · 6 comments
Closed
2 of 14 tasks

MathpixPDFLoader never finishes #11249

fincha opened this issue Sep 30, 2023 · 6 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@fincha
Copy link

fincha commented Sep 30, 2023

System Info

langchain==0.0.305
MacOS M1 Silicon

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

I have a basic methon for loading the pdf via mathpix

def load_pdf(path):
    loader = MathpixPDFLoader(path)
    pages = loader.load()
    return pages;

pages = load_pdf("example_data/DOF-230519AFYXBECVECGY-0028747724.pdf")

but if I run this, I gen only this:

Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
...

in backend of mathpix I see the document aldready parsed.

Any idea on this?

Expected behavior

Expect the document to return.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 30, 2023
@dosubot
Copy link

dosubot bot commented Sep 30, 2023

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue. Based on the context provided, it seems like there was a similar issue reported in the past where the MathpixPDFLoader was not working as expected. The issue was due to a typo in the processed_file_format variable in the __init__ function of the MathpixPDFLoader class. The variable was set to "mmd" by default, which was causing the issue. Changing it to "md" resolved the issue. This issue was fixed in LangChain version 0.0.305. You can find more details about this issue here.

However, since you're using LangChain version 0.0.305, this should not be the problem. Another possible reason could be a timeout error. The wait_for_processing function in the MathpixPDFLoader waits for a maximum of max_wait_time_seconds for the processing to complete. If the processing is not completed within this time, a TimeoutError is raised. If the document appears to be already parsed in the Mathpix backend but the loader is still waiting, it's possible that the processing is taking longer than max_wait_time_seconds. You might want to consider increasing this value if possible, or checking if there are any issues with the Mathpix backend that could be causing delays.

If none of the above solutions work, could you please provide more information about the document you're trying to process? For example, the size of the document, the complexity of the content, etc. This information could help us better understand the issue and find a solution.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@fincha
Copy link
Author

fincha commented Sep 30, 2023

71681972252-1.pdf

I am parsing this pdf. The timeout is not an issue, the detault is 500 and the result is visible in backend of mathpix



  | {   "text": "\\section{Invoice}\n\nInvoice No: 6382196379516891971\nFor questions go to\n\nhttp://support.microsoft.com/ and choose your product for support.\n\nDate: \\(10 / 06 / 2023\\)\n\n\\title{\nGermany\n}\n\n\\begin{tabular}{\|c\|c\|c\|c\|}\n\\hline Item & Qty & Amount & \\\\\n\\hline \\multirow[t]{3}{*}{ Xbox Game Pass Ultimate } & \\multirow[t]{3}{*}{1,0} & Total Pre-Tax Charges & 10,92 \\\\\n\\hline & & VAT \\((19,00 \\%)\\) & 2,07 \\\\\n\\hline & & Total Amount (EUR) & 12,99 \\\\\n\\hline\n\\end{tabular}\n\nAmount Due (EUR)",   "confidence": 0,   "is_printed": true,   "request_id": "2023_09_30_94336ce084b4d02296b5g",   "is_handwritten": false,   "confidence_rate": 0,   "auto_rotate_degrees": 0,   "auto_rotate_confidence": 0,   "version": "RSK-M119p2" }
-- | --

@sonnyjon
Copy link

sonnyjon commented Oct 7, 2023

Something changed between v0.0.300 and v0.0.305 in pdf.py that is causing a TimeoutError:

  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 435, in load
    contents = self.get_processed_pdf(pdf_id)
  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 405, in get_processed_pdf
    self.wait_for_processing(pdf_id)
  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 402, in wait_for_processing
    raise TimeoutError
TimeoutError

I messaged Mathpix support and they indicated that nothing changed on their end:

Hello Sonny, thank you for your report
Could you please provide both PDFs you tried to convert (the one that worked before and the one that's currently not)
We didn't make changes that should affect the process

Waiting for your reply, thanks!

Hi Sergey,

Hmm, I guess the problem must be on LangChain's side then. I upgraded from v0.0.300 to v0.0.305.

Please find my PDFs attached.

I guess I may have to forward my stack trace to the LangChain engineers, but it would definitely help me if you have an idea of what the problem might be (I don't know if you're familiar with their codebase).

Thanks,
Sonny

Hello, yes I've just checked and those two files are getting converted fine on our side
This is most likely caused on LangChain side.. but without any logs it's difficult to say what can be wrong there
Please keep us updated as well, once there will be news from LangChain

Thanks!

This is not due to jreinjr's fix as their solution was working in v0.0.300 and must be somewhere else.

@sonnyjon
Copy link

sonnyjon commented Oct 14, 2023

Okay, I had to do some testing to determine which version change caused the TimeOutError issue. From looking at my own code, I see that it was working in v0.0.274 (with jreinjr's fix) and upgraded from there. As it turns out, I was wrong about it it working in v0.0.300.

I thought it might have something to do with langsmith, since at some point upgrading langchain forces an upgrade to langsmith, but it still runs okay at langchain v0.0.288 and langsmith v0.0.43, so apparently has nothing to do with langsmith.

Using jreinjr's fix, the last version this was working was actually 0.0.288. Something changed between 0.0.288 and 0.0.290:

langchain | 0.0.275 | langsmith | 0.0.26 | ok
langchain | 0.0.275 | langsmith | 0.0.27 | ok
langchain | 0.0.275 | langsmith | 0.0.28 | ok
langchain | 0.0.275 | langsmith | 0.0.29 | ok
langchain | 0.0.281 | langsmith | 0.0.29 | ok
langchain | 0.0.290 | langsmith | 0.0.29 | timeout
langchain | 0.0.289 | langsmith | 0.0.29 | timeout
langchain | 0.0.285 | langsmith | 0.0.29 | ok
langchain | 0.0.286 | langsmith | 0.0.29 | ok
langchain | 0.0.287 | langsmith | 0.0.29 | ok
langchain | 0.0.288 | langsmith | 0.0.29 | ok
langchain | 0.0.288 | langsmith | 0.0.43 | ok

Please look at what changed between 0.0.288 to 0.0.289 that might have caused this to timeout. Thanks

@chadnorvell
Copy link
Contributor

chadnorvell commented Nov 27, 2023

My observation is that the request is not really timing out; as you pointed out, the PDF does get processed. But in this section of the PDF loader code, we look for { "status": "error" } or { "status": "completed" } in the response body, which is correct for the processing status, but at least in what I've observed, we're looking for it in the response body. So for example, the response I get has { "error": "400 Bad Request" }. There's no "status" key, so we just keep polling until the timeout limit is reached.

So I think there are two issues here:

  1. The code I linked above needs to correctly handle the response body format, so if we get back something like { "error": "<whatever>" } we raise the error instead of misleadingly timing out.
  2. Something's wrong with the way we're sending authorization data to the API, so we're not able to get PDF processing status or download the processed data.

I'm working on a fix for this, but I just wanted to drop my notes here first.

@chadnorvell
Copy link
Contributor

I've got a fix working at #13949.

hwchase17 pushed a commit that referenced this issue Dec 3, 2023
- **Description:** Fixes the Mathpix PDF loader API integration.
Specifically, ensures that Mathpix auth headers are provided for every
request, and ensures that we recognize all errors that can occur during
a request. Also, the option to provide API keys as kwargs never actually
worked before, but now that's fixed too.
  - **Issue:** #11249
  - **Dependencies:** None
vladkol pushed a commit to vladkol/langchain that referenced this issue Dec 4, 2023
- **Description:** Fixes the Mathpix PDF loader API integration.
Specifically, ensures that Mathpix auth headers are provided for every
request, and ensures that we recognize all errors that can occur during
a request. Also, the option to provide API keys as kwargs never actually
worked before, but now that's fixed too.
  - **Issue:** langchain-ai#11249
  - **Dependencies:** None
aymeric-roucher pushed a commit to andrewrreed/langchain that referenced this issue Dec 11, 2023
- **Description:** Fixes the Mathpix PDF loader API integration.
Specifically, ensures that Mathpix auth headers are provided for every
request, and ensures that we recognize all errors that can occur during
a request. Also, the option to provide API keys as kwargs never actually
worked before, but now that's fixed too.
  - **Issue:** langchain-ai#11249
  - **Dependencies:** None
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 27, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

3 participants