Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursionError: maximum recursion depth exceeded #404

Closed
jjhuff opened this issue Mar 24, 2020 · 8 comments
Closed

RecursionError: maximum recursion depth exceeded #404

jjhuff opened this issue Mar 24, 2020 · 8 comments

Comments

@jjhuff
Copy link

jjhuff commented Mar 24, 2020

Example url: http://www.accessdata.fda.gov/drugsatfda_docs/label/2019/021107s029lbl.pdf

Backtrace:

  File "amplion/utils/pdf.py", line 29, in extract_pdf_text_from_fp
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/converter.py", line 49, in end_page
    self.cur_item.analyze(self.laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 733, in analyze
    group.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 350, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 350, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 350, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 350, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 350, in analyze
    obj.analyze(laparams)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/layout.py", line 501, in analyze
    LTTextGroup.analyze(self, laparams)
.....
@pietermarsman
Copy link
Member

Hi @jjhuff, thanks for raising this issue!

What version of pdfminer.six are you using? And what command are you using?

@jjhuff
Copy link
Author

jjhuff commented Mar 24, 2020

I was using 20200124
The code is more or less this, but I believe it should repo with pdf2txt

    laparams = LAParams()
    rsrcmgr = PDFResourceManager()
    with io.StringIO() as outfp:
        device = TextConverter(rsrcmgr, outfp, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
        device.close()
        text = outfp.getvalue()

@pietermarsman
Copy link
Member

I can replicate this using pdf2txt.py. The problem happens on page 5.

You can fix your problem by increasing the line_margin. Setting it to 2.0 fixes it for me. This merges more lines into text-boxes and makes the next analyse step (the one with the recursion error) easier.

I'm not sure what causes the problem so I would like to keep this issue open until we figure out if there is something we can do about it.

@metalogueur
Copy link

Just ran into that same bug using version 20200726... Will wait for the issue to be solved... Thanks for your hard work!

@jonaswinkler
Copy link

jonaswinkler commented Feb 20, 2021

I've also encountered this issue on a document after calling extract_text on version 20201018

  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/high_level.py", line 121, in extract_text
    interpreter.process_page(page)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/converter.py", line 50, in end_page
    self.cur_item.analyze(self.laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 814, in analyze
    group.analyze(laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "/home/jonas/.local/share/virtualenvs/paperless-RxYopU_u/lib/python3.9/site-packages/pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)

I believe that this is the same issue and the different line numbers are caused by changes somewhere else in that file.

@rain01
Copy link

rain01 commented Mar 4, 2022

Recently I have started seeing this same issue in my error logs.
I am using pdfminer.six==20211012 with:

<LAParams: char_margin=2.0, line_margin=0.1, word_margin=0.1 all_texts=True>

Increasing line_margin helped as a temporary solution, any idea if there is a better way?

@pietermarsman
Copy link
Member

Fixed in the current version.

The issue was partially fixed by @0xabu in #659. After this PR pdf2txt.py run indefinitely on page 5.

This issue was then completely fixed by @jwyawney in #689. I guess by ignoring empty characters the layout algorithm does generate less LTTextLine and therefore the grouping is easier. Or something along that lines.

@Rumpelcita
Copy link

Hi, I'm still running into this issue on version 20220524 of pdfminer.six. I can't provide the pdf that causes the issue since it needs to be anonymized, but these are the LAParams we're using for pdfminer:

LAParams(
    line_overlap= 0.5,
    char_margin= 1.1,
    word_margin= 0.2,
    line_margin= 0.5,
    boxes_flow= 0.5,
    all_texts=True,
        )

Traceback:

RecursionError: maximum recursion depth exceeded
(245 additional frame(s) were not displayed)
...
  File "pdfminer/layout.py", line 705, in analyze
    super().analyze(laparams)
  File "pdfminer/layout.py", line 439, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 705, in analyze
    super().analyze(laparams)
  File "pdfminer/layout.py", line 439, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 439, in analyze
    obj.analyze(laparams)

maximum recursion depth exceeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants