Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added in checks for sprurious lines in malformed PDFs #689

Merged
merged 14 commits into from
Feb 22, 2022

Conversation

jwyawney
Copy link

@jwyawney jwyawney commented Nov 9, 2021

This pull request fixes issue #449. I had a similar issue with PDFs containing lines with only spaces which would group with adjacent lines based on the layout thresholds. The issue required checking of newlines in other lines within the LTTextLineHorizontal class but also exclusion of these spurious lines in the LTLayoutContainer class.

I have tested this against my own files which were causing problematic errors but have also added in the contributed samples of malformed PDFs and the tests/test_malformed.py unittest file. Please run this from the top-level directory using:

python -m unittest tests/test_malformed.py

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version
  • I have updated the README.md or I am verified that this is not necessary
  • I have updated the readthedocs documentation or I verified that this is not necessary
  • I have added a consice human-readable description of the change to CHANGELOG.md

@pietermarsman
Copy link
Member

@jwyawney I simplified the code a bit by moving the empty check to the already exisiting LTTextLineis_empty().

I think that has the same effect. Can you check if this achieves the same thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants