-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.9.0 #862
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Came across this bit of code, which helps to solve some of the mystery in issues #461 and #842: https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774 Now, for every char's fontname, we: - Check whether its a `str` or `byte` - If the latter, we check whether it's one of the well-known codes from the link above - If so, we use that (preserving the part, if present, before the `+`) - If not, we just cast to str
With regular expressions, patterns made up entirely of optional groups (e.g., r"(dfsdgfwerw)?") can return match objects that are, effectively, empty. These were being treated as real search results, and throwing errors in the process. Now they're being treated as non-results. Separately but relatedly, whitespace-only searches were throwing errors. This was due to (a) how PDFs generally represent whitespace (implicitly, rather than explicit space characters), and (b) how pdfplumber internally represents those spaces while performing layout analysis. This caused search results to have no explicit bounding box, throwing errors. Now, similarly to handling empty search results, we handle all-whitespace search results by considering them to be non-results.
Calculating the bounding boxes of the words is, upon reflection and testing, not necessary. Instead, all we need is the latest character in the current word.
This commit aims to fix two things: - The semi-crypticness of the previous version of char_begins_new_word - The inconsistency (vs. the rest of the approach) in how the method was comparing "top" to "bottom" for interline comparisons, instead of "top" to "top", as rightly and helpfully pointed out by @bellma-lilly in #840 Based on the unit tests, this shouldn't change the output of `pdfplumber` in the vast majority of use cases. It might affect some output in edge-cases, for which I apologize for any inconvenience and which I hope is balanced out by this more consistent approach's benefits in the long run.
Inspired by #852, it turns out that .search(...) gets us most of the way, already. Added a few params (`main_group`, `return_groups`, `return_chars`) to .search(...) to enable this, which also make that method more generally flexible.
Most of the groundwork was already there to add a PDF/Page.curve_edges property. And, inspired, by #858 and related issues, we now include 0/90/180/270-degree oriented curve segments into the default table-detection strategy. As before, you can still switch to the "lines_strict" to use only lines defined as such (rather than also using rect and curve edges).
Addresses issue #598
(`san-jose-pd-firearm-report.ipynb` needed a one-param fix, after not having been updated in a few versions.)
Codecov Report
@@ Coverage Diff @@
## stable #862 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 17 17
Lines 1482 1532 +50
=========================================
+ Hits 1482 1532 +50
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
From the changelog:
Changed
WordExtractor.char_begins_new_word(...)
) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)curve_edge
objects (instead of justline
andrect_edge
objects) in default table-detection strategy. (6f6b465 + #858)ffi
toffi
), and add theexpand_ligatures
boolean parameter to text-extraction methods. (86e935d + #598)Added
Page.extract_text_lines(...)
method. (4b37397 + #852)main_group
,return_groups
,return_chars
parameters toPage.search(...)
. (4b37397).curve_edges
property toPDF
andPage
. (6f6b465)Fixed
Page.search(...)
. (6f6b465 + #853)