Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTTextLineHorizontal nested immediately under LTPage #763

Open
lifepillar opened this issue Jun 3, 2022 · 7 comments
Open

LTTextLineHorizontal nested immediately under LTPage #763

lifepillar opened this issue Jun 3, 2022 · 7 comments
Labels
component: converter Related to any PDFLayoutAnalyzer good first issue A good first issue for first-time contributors status: accepted type: bug

Comments

@lifepillar
Copy link

lifepillar commented Jun 3, 2022

With the demo PDF from this page (direct link to PDF), Pdfminer.six parses a few LTTextLineHorizontal objects immediately under the LTPage object. I don't think this is expected: for instance, it breaks the script in your documentation:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("demo1.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character.fontname)
                        print(character.size)

with:

TypeError: 'LTChar' object is not iterable

Edit: completed the report with error message.

@lifepillar
Copy link
Author

I have found a related issue: #526.

Btw, I am using Pdfminer.six v20220524.

@hellpanderrr
Copy link

This started happening from 20220319 version, possibly related to #659

Reproducible example:

import urllib.request
from io import BytesIO

import pdfminer.high_level

pdf_url = 'https://www.orimi.com/pdf-test.pdf'
pdfminer_page = list(
    pdfminer.high_level.extract_pages(BytesIO(urllib.request.urlopen(pdf_url).read()))
)[0]

text_boxes = [i for i in pdfminer_page if hasattr(i, "get_text")]
print(text_boxes)

Before 20220319 version it shows
20211012:

[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextBoxHorizontal(1) 72.000,455.448,532.853,661.368 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>]

20191107 (same result):

[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,676.440 ' \n'>,
 <LTTextBoxHorizontal(1) 72.000,455.448,532.853,665.340 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,595.560 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,595.560 '.   \n'>]

After (20220319):

[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
 <LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>,
 <LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
 <LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at:  http://www.education.gov.yk.ca/\n'>,
 <LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
 <LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
 <LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
 <LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
 <LTTextLineHorizontal 72.000,455.448,82.061,467.448 '   \n'>]

20220524:

[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
 <LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>,
 <LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
 <LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at:  http://www.education.gov.yk.ca/\n'>,
 <LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
 <LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
 <LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
 <LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
 <LTTextLineHorizontal 72.000,455.448,82.061,467.448 '   \n'>]

@pietermarsman
Copy link
Member

This was introduced by: 43c8fc8

@pietermarsman
Copy link
Member

This happens because these text lines only contain white space. Previously, all text lines with a zero width or high were added directly under the page object. After the change text lines with just white space are also added directly.

I guess it is preferable if the hierarchy is always the same. Always LTPage -> LTTextBox -> LtTextLine -> LTChar.

The empty textlines on this line (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L949) need to be wrapped in a LTTextBox.

@pietermarsman pietermarsman added type: bug component: converter Related to any PDFLayoutAnalyzer labels Jun 25, 2022
@pietermarsman pietermarsman added the good first issue A good first issue for first-time contributors label Aug 8, 2022
@KunalGehlot
Copy link
Contributor

I'm picking up this issue.

KunalGehlot pushed a commit to KunalGehlot/pdfminer.six that referenced this issue Aug 24, 2022
@KunalGehlot
Copy link
Contributor

KunalGehlot commented Aug 24, 2022

This commit fixes the issue, but I'm unsure if it's the ideal way.

I tested the code with @lifepillar 's code and manually checked the hierarchy of the LT Objects.

But I'm getting tests/test_layout.py:130: AssertionError and tests/test_layout.py:148: AssertionError while running nox because the tests have hardcoded assert len(textboxes) == 3 and are throwing AssertionError: assert 7 == 3.

Update: I've removed the branch to avoid confusion.
All I did was add these two lines to the code after textboxes = list(self.group_textlines(laparams, textlines))

empties = list(self.group_textlines(laparams, empties))
textboxes.extend(empties)

@pietermarsman
Copy link
Member

@KunalGehlot Can you create a PR with that specific commit such that I can review and merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer good first issue A good first issue for first-time contributors status: accepted type: bug
Projects
None yet
Development

No branches or pull requests

4 participants