-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version pdfminer.six 20191107 incorrectly orders some text #334
Labels
Comments
Hi @lithiumFlower, thanks for raising this issue. I think this is caused by PR #315, which improved the speedup of layout analysis by 20% to 500%, depending on the PDF. It should not have deteriorated the result, but in this case it clearly does. I will try to pinpoint what's going wrong. |
pietermarsman
added
type: bug
component: converter
Related to any PDFLayoutAnalyzer
labels
Nov 8, 2019
It is fixed in the latest version! |
I'm not used to open source moving quickly - thanks @pietermarsman for the fix |
👍 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Upgrading from version 20181108 to 20191107 pdfminer parses some words out of order.
In version 20181108 the ordering was correct, see first output below.
In version 20191107 the ordering is incorrect, see second output below.
In this pdf: http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
The fourth bullet point ends with the word "anyone". When parsing, "anyone" now ends up at the end of the third bullet point instead (directly following "incompatibilities").
Sample parsing logic:
20181108 output
20191107 output
The text was updated successfully, but these errors were encountered: