v0.9.0 #862

jsvine · 2023-04-13T12:46:48Z

From the changelog:

Changed

Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
By default, expand ligatures into their consituent letters (e.g., ﬃ to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

Add Page.extract_text_lines(...) method. (4b37397 + #852)
Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

Came across this bit of code, which helps to solve some of the mystery in issues #461 and #842: https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774 Now, for every char's fontname, we: - Check whether its a `str` or `byte` - If the latter, we check whether it's one of the well-known codes from the link above - If so, we use that (preserving the part, if present, before the `+`) - If not, we just cast to str

With regular expressions, patterns made up entirely of optional groups (e.g., r"(dfsdgfwerw)?") can return match objects that are, effectively, empty. These were being treated as real search results, and throwing errors in the process. Now they're being treated as non-results. Separately but relatedly, whitespace-only searches were throwing errors. This was due to (a) how PDFs generally represent whitespace (implicitly, rather than explicit space characters), and (b) how pdfplumber internally represents those spaces while performing layout analysis. This caused search results to have no explicit bounding box, throwing errors. Now, similarly to handling empty search results, we handle all-whitespace search results by considering them to be non-results.

Calculating the bounding boxes of the words is, upon reflection and testing, not necessary. Instead, all we need is the latest character in the current word.

@bellma-lilly

This commit aims to fix two things: - The semi-crypticness of the previous version of char_begins_new_word - The inconsistency (vs. the rest of the approach) in how the method was comparing "top" to "bottom" for interline comparisons, instead of "top" to "top", as rightly and helpfully pointed out by @bellma-lilly in #840 Based on the unit tests, this shouldn't change the output of `pdfplumber` in the vast majority of use cases. It might affect some output in edge-cases, for which I apologize for any inconvenience and which I hope is balanced out by this more consistent approach's benefits in the long run.

Inspired by #852, it turns out that .search(...) gets us most of the way, already. Added a few params (`main_group`, `return_groups`, `return_chars`) to .search(...) to enable this, which also make that method more generally flexible.

Most of the groundwork was already there to add a PDF/Page.curve_edges property. And, inspired, by #858 and related issues, we now include 0/90/180/270-degree oriented curve segments into the default table-detection strategy. As before, you can still switch to the "lines_strict" to use only lines defined as such (rather than also using rect and curve edges).

Addresses issue #598

(`san-jose-pd-firearm-report.ipynb` needed a one-param fix, after not having been updated in a few versions.)

codecov · 2023-04-13T12:48:44Z

Codecov Report

Merging #862 (3e0f9d7) into stable (1d5d646) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            stable      #862   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           17        17           
  Lines         1482      1532   +50     
=========================================
+ Hits          1482      1532   +50

Impacted Files	Coverage Δ
pdfplumber/_version.py	`100.00% <100.00%> (ø)`
pdfplumber/container.py	`100.00% <100.00%> (ø)`
pdfplumber/page.py	`100.00% <100.00%> (ø)`
pdfplumber/utils/geometry.py	`100.00% <100.00%> (ø)`
pdfplumber/utils/text.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jsvine added 11 commits April 13, 2023 08:13

Simplify char_begins_new_word

ebb93ea

Calculating the bounding boxes of the words is, upon reflection and testing, not necessary. Instead, all we need is the latest character in the current word.

Label ordered_chars more explicitly as such

79ec839

Add .extract_text_lines & related .search params

4b37397

Inspired by #852, it turns out that .search(...) gets us most of the way, already. Added a few params (`main_group`, `return_groups`, `return_chars`) to .search(...) to enable this, which also make that method more generally flexible.

Remove line of code that does nothing

f042d5c

By default, expand ligatures into their letters

86e935d

Addresses issue #598

Bump version to v0.9.0

1a8603b

Rerun examples/notebooks for v0.9.0 (incl one fix)

3e0f9d7

(`san-jose-pd-firearm-report.ipynb` needed a one-param fix, after not having been updated in a few versions.)

jsvine merged commit 255eaac into stable Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.0 #862

v0.9.0 #862

jsvine commented Apr 13, 2023

codecov bot commented Apr 13, 2023 •

edited

Loading

v0.9.0 #862

v0.9.0 #862

Conversation

jsvine commented Apr 13, 2023

Changed

Added

Fixed

codecov bot commented Apr 13, 2023 • edited Loading

Codecov Report

codecov bot commented Apr 13, 2023 •

edited

Loading