Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] Disable advanced layout analysis #50

Closed
jstockwin opened this issue Mar 17, 2020 · 1 comment · Fixed by #88
Closed

[performance] Disable advanced layout analysis #50

jstockwin opened this issue Mar 17, 2020 · 1 comment · Fixed by #88

Comments

@jstockwin
Copy link
Owner

jstockwin commented Mar 17, 2020

I noticed that by setting boxes_flow outside the documented range, you can actually disable PDFMiner's advanced layout analysis.

We don't need the advanced analysis since we have no hierarchy of text boxes and we order them ourselves, and it's quite a performance gain to leave these out.

I've filed an issue (and fix) to update the documentation and also allow boxes_flow to be passed as None to explicitly disable this: pdfminer/pdfminer.six#395

Once that's merged, we should either default or hard-code our boxes_flow la param to None. It feels like we should allow it to be overridden, but equally since we ignore the resulting analysis perhaps there's no point and we should hard-code it to None.

@jstockwin
Copy link
Owner Author

There's a new pdfminer release (20200517) which has a fix for the blocking pdfminer/pdfminer.six#411 bug.

We should bump the pdfminer version and then go ahead and implement the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant