MathpixPDFLoader never finishes #11249

fincha · 2023-09-30T12:12:22Z

System Info

langchain==0.0.305
MacOS M1 Silicon

Who can help?

No response

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

I have a basic methon for loading the pdf via mathpix

def load_pdf(path):
    loader = MathpixPDFLoader(path)
    pages = loader.load()
    return pages;

pages = load_pdf("example_data/DOF-230519AFYXBECVECGY-0028747724.pdf")

but if I run this, I gen only this:

Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
Status: None, waiting for processing to complete
...

in backend of mathpix I see the document aldready parsed.

Any idea on this?

Expected behavior

Expect the document to return.

The text was updated successfully, but these errors were encountered:

dosubot · 2023-09-30T12:17:12Z

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue. Based on the context provided, it seems like there was a similar issue reported in the past where the MathpixPDFLoader was not working as expected. The issue was due to a typo in the processed_file_format variable in the __init__ function of the MathpixPDFLoader class. The variable was set to "mmd" by default, which was causing the issue. Changing it to "md" resolved the issue. This issue was fixed in LangChain version 0.0.305. You can find more details about this issue here.

However, since you're using LangChain version 0.0.305, this should not be the problem. Another possible reason could be a timeout error. The wait_for_processing function in the MathpixPDFLoader waits for a maximum of max_wait_time_seconds for the processing to complete. If the processing is not completed within this time, a TimeoutError is raised. If the document appears to be already parsed in the Mathpix backend but the loader is still waiting, it's possible that the processing is taking longer than max_wait_time_seconds. You might want to consider increasing this value if possible, or checking if there are any issues with the Mathpix backend that could be causing delays.

If none of the above solutions work, could you please provide more information about the document you're trying to process? For example, the size of the document, the complexity of the content, etc. This information could help us better understand the issue and find a solution.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

fincha · 2023-09-30T12:41:19Z

71681972252-1.pdf

I am parsing this pdf. The timeout is not an issue, the detault is 500 and the result is visible in backend of mathpix



  | {   "text": "\\section{Invoice}\n\nInvoice No: 6382196379516891971\nFor questions go to\n\nhttp://support.microsoft.com/ and choose your product for support.\n\nDate: \\(10 / 06 / 2023\\)\n\n\\title{\nGermany\n}\n\n\\begin{tabular}{\|c\|c\|c\|c\|}\n\\hline Item & Qty & Amount & \\\\\n\\hline \\multirow[t]{3}{*}{ Xbox Game Pass Ultimate } & \\multirow[t]{3}{*}{1,0} & Total Pre-Tax Charges & 10,92 \\\\\n\\hline & & VAT \\((19,00 \\%)\\) & 2,07 \\\\\n\\hline & & Total Amount (EUR) & 12,99 \\\\\n\\hline\n\\end{tabular}\n\nAmount Due (EUR)",   "confidence": 0,   "is_printed": true,   "request_id": "2023_09_30_94336ce084b4d02296b5g",   "is_handwritten": false,   "confidence_rate": 0,   "auto_rotate_degrees": 0,   "auto_rotate_confidence": 0,   "version": "RSK-M119p2" }
-- | --

sonnyjon · 2023-10-07T17:05:35Z

Something changed between v0.0.300 and v0.0.305 in pdf.py that is causing a TimeoutError:

  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 435, in load
    contents = self.get_processed_pdf(pdf_id)
  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 405, in get_processed_pdf
    self.wait_for_processing(pdf_id)
  File "C:\Workspace\python\langchain\demo-projects\pdf-loaders\venv\lib\site-packages\langchain\document_loaders\pdf.py", line 402, in wait_for_processing
    raise TimeoutError
TimeoutError

I messaged Mathpix support and they indicated that nothing changed on their end:

Hello Sonny, thank you for your report
Could you please provide both PDFs you tried to convert (the one that worked before and the one that's currently not)
We didn't make changes that should affect the process

Waiting for your reply, thanks!

Hi Sergey,

Hmm, I guess the problem must be on LangChain's side then. I upgraded from v0.0.300 to v0.0.305.

Please find my PDFs attached.

I guess I may have to forward my stack trace to the LangChain engineers, but it would definitely help me if you have an idea of what the problem might be (I don't know if you're familiar with their codebase).

Thanks,
Sonny

Hello, yes I've just checked and those two files are getting converted fine on our side
This is most likely caused on LangChain side.. but without any logs it's difficult to say what can be wrong there
Please keep us updated as well, once there will be news from LangChain

Thanks!

This is not due to jreinjr's fix as their solution was working in v0.0.300 and must be somewhere else.

sonnyjon · 2023-10-14T21:48:29Z

Okay, I had to do some testing to determine which version change caused the TimeOutError issue. From looking at my own code, I see that it was working in v0.0.274 (with jreinjr's fix) and upgraded from there. As it turns out, I was wrong about it it working in v0.0.300.

I thought it might have something to do with langsmith, since at some point upgrading langchain forces an upgrade to langsmith, but it still runs okay at langchain v0.0.288 and langsmith v0.0.43, so apparently has nothing to do with langsmith.

Using jreinjr's fix, the last version this was working was actually 0.0.288. Something changed between 0.0.288 and 0.0.290:

langchain | 0.0.275 | langsmith | 0.0.26 | ok
langchain | 0.0.275 | langsmith | 0.0.27 | ok
langchain | 0.0.275 | langsmith | 0.0.28 | ok
langchain | 0.0.275 | langsmith | 0.0.29 | ok
langchain | 0.0.281 | langsmith | 0.0.29 | ok
langchain | 0.0.290 | langsmith | 0.0.29 | timeout
langchain | 0.0.289 | langsmith | 0.0.29 | timeout
langchain | 0.0.285 | langsmith | 0.0.29 | ok
langchain | 0.0.286 | langsmith | 0.0.29 | ok
langchain | 0.0.287 | langsmith | 0.0.29 | ok
langchain | 0.0.288 | langsmith | 0.0.29 | ok
langchain | 0.0.288 | langsmith | 0.0.43 | ok

Please look at what changed between 0.0.288 to 0.0.289 that might have caused this to timeout. Thanks

chadnorvell · 2023-11-27T22:16:11Z

My observation is that the request is not really timing out; as you pointed out, the PDF does get processed. But in this section of the PDF loader code, we look for { "status": "error" } or { "status": "completed" } in the response body, which is correct for the processing status, but at least in what I've observed, we're looking for it in the response body. So for example, the response I get has { "error": "400 Bad Request" }. There's no "status" key, so we just keep polling until the timeout limit is reached.

So I think there are two issues here:

The code I linked above needs to correctly handle the response body format, so if we get back something like { "error": "<whatever>" } we raise the error instead of misleadingly timing out.
Something's wrong with the way we're sending authorization data to the API, so we're not able to get PDF processing status or download the processed data.

I'm working on a fix for this, but I just wanted to drop my notes here first.

chadnorvell · 2023-11-28T05:27:01Z

I've got a fix working at #13949.

- **Description:** Fixes the Mathpix PDF loader API integration. Specifically, ensures that Mathpix auth headers are provided for every request, and ensures that we recognize all errors that can occur during a request. Also, the option to provide API keys as kwargs never actually worked before, but now that's fixed too. - **Issue:** #11249 - **Dependencies:** None

- **Description:** Fixes the Mathpix PDF loader API integration. Specifically, ensures that Mathpix auth headers are provided for every request, and ensures that we recognize all errors that can occur during a request. Also, the option to provide API keys as kwargs never actually worked before, but now that's fixed too. - **Issue:** langchain-ai#11249 - **Dependencies:** None

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 30, 2023

chadnorvell mentioned this issue Nov 28, 2023

Fix Mathpix PDF loader integration #13949

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 27, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MathpixPDFLoader never finishes #11249

MathpixPDFLoader never finishes #11249

fincha commented Sep 30, 2023

dosubot bot commented Sep 30, 2023

fincha commented Sep 30, 2023

sonnyjon commented Oct 7, 2023 •

edited

Loading

sonnyjon commented Oct 14, 2023 •

edited

Loading

chadnorvell commented Nov 27, 2023 •

edited

Loading

chadnorvell commented Nov 28, 2023

MathpixPDFLoader never finishes #11249

MathpixPDFLoader never finishes #11249

Comments

fincha commented Sep 30, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

dosubot bot commented Sep 30, 2023

Sources

fincha commented Sep 30, 2023

sonnyjon commented Oct 7, 2023 • edited Loading

sonnyjon commented Oct 14, 2023 • edited Loading

chadnorvell commented Nov 27, 2023 • edited Loading

chadnorvell commented Nov 28, 2023

sonnyjon commented Oct 7, 2023 •

edited

Loading

sonnyjon commented Oct 14, 2023 •

edited

Loading

chadnorvell commented Nov 27, 2023 •

edited

Loading