Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slowdown in extract_words on long words (#483) #497

Merged
merged 1 commit into from
Oct 15, 2021
Merged

Conversation

jsvine
Copy link
Owner

@jsvine jsvine commented Sep 2, 2021

This commit/PR fixes the slowdown in .extract_words(...) and WordExtractor.iter_chars_to_words(...) which occurred for very long "words", and which was caused by repeatedly re-calculating bounding box. See #483 for discussion and example.

It also adds utils.merge_bboxes(bboxes), which returns the smallest bounding box that contains all bounding boxes in the bboxes argument, and which we use in the fix.

@jsvine jsvine changed the base branch from stable to develop September 2, 2021 02:16
@jsvine jsvine requested a review from samkit-jain September 2, 2021 02:16
This commit fixes the slowdown in `.extract_words(...)` and
`WordExtractor.iter_chars_to_words(...)` which occurred for very long
"words", and which was caused by repeatedly re-calculating bounding box.

See #483 for discussion and example.

It also adds `utils.merge_bboxes(bboxes)`, which returns the smallest
bounding box that contains all bounding boxes in the `bboxes` argument,
and which we use in the fix.
Repository owner deleted a comment from codecov bot Sep 2, 2021
@codecov
Copy link

codecov bot commented Sep 2, 2021

Codecov Report

Merging #497 (2f598b4) into develop (9019854) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 2f598b4 differs from pull request most recent head f8d5e70. Consider uploading reports for the commit f8d5e70 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #497      +/-   ##
===========================================
+ Coverage    98.28%   98.30%   +0.01%     
===========================================
  Files           10       10              
  Lines         1228     1236       +8     
===========================================
+ Hits          1207     1215       +8     
  Misses          21       21              
Impacted Files Coverage Δ
pdfplumber/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9019854...f8d5e70. Read the comment docs.

@samkit-jain
Copy link
Collaborator

@jsvine Do we also have any time comparison stats to validate that it is now indeed faster? As per https://stackoverflow.com/a/19527487/7760998, we can also look at adding a test for it based on the PDF that was shared in the issue/discussion.

@jsvine jsvine merged commit 7a65785 into develop Oct 15, 2021
@jsvine
Copy link
Owner Author

jsvine commented Oct 15, 2021

@samkit-jain Apologies for missing your comment here. I conducted some informal tests noted in #483 (reply in thread) but nothing very rigorous. Certainly open to adding a test!

@samkit-jain samkit-jain deleted the fix/483 branch October 17, 2021 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants