Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.9.0 #862

Merged
merged 11 commits into from
Apr 13, 2023
Merged

v0.9.0 #862

merged 11 commits into from
Apr 13, 2023

Conversation

jsvine
Copy link
Owner

@jsvine jsvine commented Apr 13, 2023

From the changelog:

Changed

  • Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
  • Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
  • By default, expand ligatures into their consituent letters (e.g., to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

  • Add Page.extract_text_lines(...) method. (4b37397 + #852)
  • Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
  • Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

  • Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
  • Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

jsvine added 11 commits April 13, 2023 08:13
Came across this bit of code, which helps to solve some of the mystery
in issues #461 and #842:

https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774

Now, for every char's fontname, we:

- Check whether its a `str` or `byte`
    - If the latter, we check whether it's one of the well-known codes from
      the link above
        - If so, we use that (preserving the part, if present, before
          the `+`)
        - If not, we just cast to str
With regular expressions, patterns made up entirely of optional groups
(e.g., r"(dfsdgfwerw)?") can return match objects that are, effectively,
empty. These were being treated as real search results, and throwing
errors in the process. Now they're being treated as non-results.

Separately but relatedly, whitespace-only searches were throwing errors.
This was due to (a) how PDFs generally represent whitespace (implicitly,
rather than explicit space characters), and (b) how pdfplumber
internally represents those spaces while performing layout analysis.
This caused search results to have no explicit bounding box, throwing
errors. Now, similarly to handling empty search results, we handle
all-whitespace search results by considering them to be non-results.
Calculating the bounding boxes of the words is, upon reflection and
testing, not necessary. Instead, all we need is the latest character in
the current word.
This commit aims to fix two things:

- The semi-crypticness of the previous version of char_begins_new_word
- The inconsistency (vs. the rest of the approach) in how the method
  was comparing "top" to "bottom" for interline comparisons, instead of
  "top" to "top", as rightly and helpfully pointed out by @bellma-lilly
  in #840

Based on the unit tests, this shouldn't change the output of
`pdfplumber` in the vast majority of use cases. It might affect some
output in edge-cases, for which I apologize for any inconvenience and
which I hope is balanced out by this more consistent approach's benefits
in the long run.
Inspired by #852,
it turns out that .search(...) gets us most of the way, already.

Added a few params (`main_group`, `return_groups`, `return_chars`) to
.search(...) to enable this, which also make that method more generally
flexible.
Most of the groundwork was already there to add a PDF/Page.curve_edges
property. And, inspired, by
#858 and related issues,
we now include 0/90/180/270-degree oriented curve segments into the
default table-detection strategy. As before, you can still switch to the
"lines_strict" to use only lines defined as such (rather than also using
rect and curve edges).
(`san-jose-pd-firearm-report.ipynb` needed a one-param fix, after not
having been updated in a few versions.)
@codecov
Copy link

codecov bot commented Apr 13, 2023

Codecov Report

Merging #862 (3e0f9d7) into stable (1d5d646) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            stable      #862   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           17        17           
  Lines         1482      1532   +50     
=========================================
+ Hits          1482      1532   +50     
Impacted Files Coverage Δ
pdfplumber/_version.py 100.00% <100.00%> (ø)
pdfplumber/container.py 100.00% <100.00%> (ø)
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/utils/geometry.py 100.00% <100.00%> (ø)
pdfplumber/utils/text.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jsvine jsvine merged commit 255eaac into stable Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant