Skip to content

Commit

Permalink
Add (experimental) page.search(...) feature
Browse files Browse the repository at this point in the history
First proposed here: #201

Adding this feature involved refactoring and re-engineering a good chunk
of the text-layout-extraction code. As part of that, this commit
introduces two new classes, in utils.py: LayoutEngine and TextLayout.
They should be considered provisional, and may change name/approach in
the future.
  • Loading branch information
jsvine committed May 13, 2022
1 parent 056b831 commit 58b1ab1
Show file tree
Hide file tree
Showing 6 changed files with 285 additions and 93 deletions.
3 changes: 1 addition & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ All notable changes to this project will be documented in this file. The format

- Add `"matrix"` property to `char` objects, representing the current transformation matrix.
- Add `pdfplumber.ctm` submodule with class `CTM`, to calculate scale, skew, and translation of the current transformation matrix.
- Add `page.search(...)`, an *experimental feature* that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. ([#201](https://github.com/jsvine/pdfplumber/issues/201))

## [0.6.2] - 2022-05-06

Expand All @@ -28,8 +29,6 @@ All notable changes to this project will be documented in this file. The format
- Remove `utils.filter_objects(...)` and move the functionality to within the `FilteredPage.objects` property calculation, the only part of the library that used it. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))
- Remove code that sets `pdfminer.pdftypes.STRICT = True` and `pdfminer.pdfinterp.STRICT = True`, since that [has now been the default for a while](https://github.com/pdfminer/pdfminer.six/commit/9439a3a31a347836aad1c1226168156125d9505f). ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))

### Fixed

## [0.6.1] - 2022-04-23

### Changed
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ The `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll d
|`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[])`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](https://github.com/jsvine/pdfplumber/blob/develop/README.md#char-properties), and the resulting word dicts will indicate those attributes.|
|`.search(pattern, regex=True, case=True, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner.|
|`.extract_tables(table_settings)`| Extracts tabular data from the page. For more details see "[Extracting tables](#extracting-tables)" below.|
|`.to_image(**conversion_kwargs)`| Returns an instance of the `PageImage` class. For more details, see "[Visual debugging](#visual-debugging)" below. For conversion_kwargs, see [here](http://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image).|
|`.close()`| By default, `Page` objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory. (In version `<= 0.5.25`, use `.flush_cache()`.)|
Expand Down
39 changes: 35 additions & 4 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
import re
from typing import TYPE_CHECKING, Any, Callable, Dict, Generator, List, Optional, Tuple
from functools import lru_cache
from typing import (
TYPE_CHECKING,
Any,
Callable,
Dict,
Generator,
List,
Optional,
Pattern,
Tuple,
Union,
)

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import (
Expand Down Expand Up @@ -287,10 +299,29 @@ def sorter(x: Table) -> Tuple[int, T_num, T_num]:

return largest.extract(**extract_kwargs)

@lru_cache
def get_text_layout(self, **kwargs: Any) -> utils.TextLayout:
defaults = dict(x_shift=self.bbox[0], y_shift=self.bbox[1])
full_kwargs: Dict[str, Any] = {**defaults, **kwargs}
return utils.chars_to_layout(self.chars, **full_kwargs)

def search(
self,
pattern: Union[str, Pattern[str]],
regex: bool = True,
case: bool = True,
**kwargs: Any,
) -> List[Dict[str, Any]]:
text_layout = self.get_text_layout(**kwargs)
return text_layout.search(pattern, regex=regex, case=case)

def extract_text(self, **kwargs: Any) -> str:
return utils.extract_text(
self.chars, x_shift=self.bbox[0], y_shift=self.bbox[1], **kwargs
)
if kwargs.get("layout") is True:
del kwargs["layout"]
text_layout = self.get_text_layout(**kwargs)
return text_layout.to_string()
else:
return utils.extract_text(self.chars, **kwargs)

def extract_words(self, **kwargs: Any) -> T_obj_list:
return utils.extract_words(self.chars, **kwargs)
Expand Down
6 changes: 3 additions & 3 deletions pdfplumber/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def words_to_edges_h(
Find (imaginary) horizontal lines that connect the tops
of at least `word_threshold` words.
"""
by_top = utils.cluster_objects(words, "top", 1)
by_top = utils.cluster_objects(words, itemgetter("top"), 1)
large_clusters = filter(lambda x: len(x) >= word_threshold, by_top)
rects = list(map(utils.objects_to_rect, large_clusters))
if len(rects) == 0:
Expand Down Expand Up @@ -149,8 +149,8 @@ def words_to_edges_v(
center of at least `word_threshold` words.
"""
# Find words that share the same left, right, or centerpoints
by_x0 = utils.cluster_objects(words, "x0", 1)
by_x1 = utils.cluster_objects(words, "x1", 1)
by_x0 = utils.cluster_objects(words, itemgetter("x0"), 1)
by_x1 = utils.cluster_objects(words, itemgetter("x1"), 1)

def get_center(word: T_obj) -> T_num:
return float(word["x0"] + word["x1"]) / 2
Expand Down
Loading

0 comments on commit 58b1ab1

Please sign in to comment.