-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LTTextLineHorizontal nested immediately under LTPage #763
Comments
I have found a related issue: #526. Btw, I am using Pdfminer.six v20220524. |
This started happening from Reproducible example:
Before
After (
|
This was introduced by: 43c8fc8 |
This happens because these text lines only contain white space. Previously, all text lines with a zero width or high were added directly under the page object. After the change text lines with just white space are also added directly. I guess it is preferable if the hierarchy is always the same. Always LTPage -> LTTextBox -> LtTextLine -> LTChar. The empty textlines on this line (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L949) need to be wrapped in a LTTextBox. |
I'm picking up this issue. |
This commit fixes the issue, but I'm unsure if it's the ideal way. I tested the code with @lifepillar 's code and manually checked the hierarchy of the LT Objects. But I'm getting Update: I've removed the branch to avoid confusion. empties = list(self.group_textlines(laparams, empties))
textboxes.extend(empties) |
@KunalGehlot Can you create a PR with that specific commit such that I can review and merge it? |
With the demo PDF from this page (direct link to PDF), Pdfminer.six parses a few
LTTextLineHorizontal
objects immediately under theLTPage
object. I don't think this is expected: for instance, it breaks the script in your documentation:with:
Edit: completed the report with error message.
The text was updated successfully, but these errors were encountered: