Skip to content

Commit

Permalink
Add join_x/y_tolerance to table extract. settings
Browse files Browse the repository at this point in the history
Now all tolerance settings have x/y versions as well.

This commit also changes `table.merge_edges(...)` behavior when
`join_tolerance` (and `x`/`y` variants) `<= 0`, so that joining is
attempted regardless, to handle cases of overlapping lines.
  • Loading branch information
jsvine committed Dec 2, 2021
1 parent 7ed4742 commit cbb34ce
Show file tree
Hide file tree
Showing 4 changed files with 60 additions and 23 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@ All notable changes to this project will be documented in this file. The format
- Add `utils.merge_bboxes(bboxes)`, which returns the smallest bounding box that contains all bounding boxes in the `bboxes` argument. ([f8d5e70](https://github.com/jsvine/pdfplumber/commit/f8d5e70a509aa9ed3ee565d7d3f97bb5ec67f5a5))
- Add `--precision` argument to CLI ([#520](https://github.com/jsvine/pdfplumber/pull/520))
- Add `snap_x_tolerance` and `snap_y_tolerance` to table extraction settings. ([#51](https://github.com/jsvine/pdfplumber/pull/51) + [#475](https://github.com/jsvine/pdfplumber/issues/475)) [h/t @dustindall]
- Add `join_x_tolerance` and `join_y_tolerance` to table extraction settings.

## Changed
- Upgrade `pdfminer.six` from `20200517` to `20211012`; see [that library's changelog](https://github.com/pdfminer/pdfminer.six/blob/develop/CHANGELOG.md) for details, but a key difference is an improvement in how it assigns `line`, `rect`, and `curve` objects. (Diagonal two-point lines, for instance, are now `line` objects instead of `curve` objects.) ([#515](https://github.com/jsvine/pdfplumber/pull/515))
- Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by `pdfminer.six` ([#346](https://github.com/jsvine/pdfplumber/discussions/346) + [#520](https://github.com/jsvine/pdfplumber/pull/520))
- `.extract_text(...)` returns `""` instead of `None` when character list is empty. ([#482](https://github.com/jsvine/pdfplumber/issues/482) + [cb9900b](https://github.com/jsvine/pdfplumber/commit/cb9900b49706e96df520dbd1067c2a57a4cdb20d)) [h/t @tungph]
- `.extract_words(...)` now includes `doctop` among the attributes it returns for each word. ([66fef89](https://github.com/jsvine/pdfplumber/commit/66fef89b670cf95d13a5e23040c7bf9339944c01))
- Change behavior of horizontal `text_strategy`, so that it uses the top and bottom of *every* word, not just the top of every word and the bottom of the last. ([#467](https://github.com/jsvine/pdfplumber/pull/467) + [#466](https://github.com/jsvine/pdfplumber/issues/466) + [#265](https://github.com/jsvine/pdfplumber/issues/265)) [h/t @bobluda + @samkit-jain]
- Change `table.merge_edges(...)` behavior when `join_tolerance` (and `x`/`y` variants) `<= 0`, so that joining is attempted regardless, to handle cases of overlapping lines.

### Fixed
- Fix slowdown in `.extract_words(...)`/`WordExtractor.iter_chars_to_words(...)` on very long words, caused by repeatedly re-calculating bounding box. ([#483](https://github.com/jsvine/pdfplumber/discussions/483))
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
"snap_x_tolerance": 3,
"snap_y_tolerance": 3,
"join_tolerance": 3,
"join_x_tolerance": 3,
"join_y_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
Expand All @@ -333,7 +335,7 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
|`"explicit_vertical_lines"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|
|`"explicit_horizontal_lines"`| A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|
|`"snap_tolerance"`, `"snap_x_tolerance"`, `"snap_y_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
|`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
|`"join_tolerance"`, `"join_x_tolerance"`, `"join_y_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
Expand Down
41 changes: 23 additions & 18 deletions pdfplumber/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,9 @@ def join_edge_group(edges, orientation, tolerance=DEFAULT_JOIN_TOLERANCE):
return joined


def merge_edges(edges, snap_x_tolerance, snap_y_tolerance, join_tolerance):
def merge_edges(
edges, snap_x_tolerance, snap_y_tolerance, join_x_tolerance, join_y_tolerance
):
"""
Using the `snap_edges` and `join_edge_group` methods above,
merge a list of edges into a more "seamless" list.
Expand All @@ -66,13 +68,15 @@ def get_group(edge):
if snap_x_tolerance > 0 or snap_y_tolerance > 0:
edges = snap_edges(edges, snap_x_tolerance, snap_y_tolerance)

if join_tolerance > 0:
_sorted = sorted(edges, key=get_group)
edge_groups = itertools.groupby(_sorted, key=get_group)
edge_gen = (
join_edge_group(items, k[0], join_tolerance) for k, items in edge_groups
_sorted = sorted(edges, key=get_group)
edge_groups = itertools.groupby(_sorted, key=get_group)
edge_gen = (
join_edge_group(
items, k[0], (join_x_tolerance if k[0] == "h" else join_y_tolerance)
)
edges = list(itertools.chain(*edge_gen))
for k, items in edge_groups
)
edges = list(itertools.chain(*edge_gen))
return edges


Expand Down Expand Up @@ -419,6 +423,8 @@ def char_in_bbox(char, bbox):
"snap_x_tolerance": None,
"snap_y_tolerance": None,
"join_tolerance": DEFAULT_JOIN_TOLERANCE,
"join_x_tolerance": None,
"join_y_tolerance": None,
"edge_min_length": 3,
"min_words_vertical": DEFAULT_MIN_WORDS_VERTICAL,
"min_words_horizontal": DEFAULT_MIN_WORDS_HORIZONTAL,
Expand Down Expand Up @@ -483,6 +489,8 @@ def resolve_table_settings(table_settings={}):
("text_y_tolerance", "text_tolerance"),
("snap_x_tolerance", "snap_tolerance"),
("snap_y_tolerance", "snap_tolerance"),
("join_x_tolerance", "join_tolerance"),
("join_y_tolerance", "join_tolerance"),
("intersection_x_tolerance", "intersection_tolerance"),
("intersection_y_tolerance", "intersection_tolerance"),
]:
Expand Down Expand Up @@ -581,15 +589,12 @@ def get_edges(self):

edges = list(v) + list(h)

if (
settings["snap_x_tolerance"] > 0
or settings["snap_y_tolerance"] > 0
or settings["join_tolerance"] > 0
):
edges = merge_edges(
edges,
snap_x_tolerance=settings["snap_x_tolerance"],
snap_y_tolerance=settings["snap_y_tolerance"],
join_tolerance=settings["join_tolerance"],
)
edges = merge_edges(
edges,
snap_x_tolerance=settings["snap_x_tolerance"],
snap_y_tolerance=settings["snap_y_tolerance"],
join_x_tolerance=settings["join_x_tolerance"],
join_y_tolerance=settings["join_y_tolerance"],
)

return utils.filter_edges(edges, min_length=settings["edge_min_length"])
36 changes: 32 additions & 4 deletions tests/test_ca_warn_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,23 +83,47 @@ def test_edge_merging(self):
assert (
len(
table.merge_edges(
p0.edges, snap_x_tolerance=3, snap_y_tolerance=3, join_tolerance=3
p0.edges,
snap_x_tolerance=3,
snap_y_tolerance=3,
join_x_tolerance=3,
join_y_tolerance=3,
)
)
== 46
)
assert (
len(
table.merge_edges(
p0.edges, snap_x_tolerance=0, snap_y_tolerance=3, join_tolerance=3
p0.edges,
snap_x_tolerance=3,
snap_y_tolerance=3,
join_x_tolerance=3,
join_y_tolerance=0,
)
)
== 52
)
assert (
len(
table.merge_edges(
p0.edges,
snap_x_tolerance=0,
snap_y_tolerance=3,
join_x_tolerance=3,
join_y_tolerance=3,
)
)
== 94
)
assert (
len(
table.merge_edges(
p0.edges, snap_x_tolerance=3, snap_y_tolerance=0, join_tolerance=3
p0.edges,
snap_x_tolerance=3,
snap_y_tolerance=0,
join_x_tolerance=3,
join_y_tolerance=3,
)
)
== 174
Expand All @@ -108,7 +132,11 @@ def test_edge_merging(self):
def test_vertices(self):
p0 = self.pdf.pages[0]
edges = table.merge_edges(
p0.edges, snap_x_tolerance=3, snap_y_tolerance=3, join_tolerance=3
p0.edges,
snap_x_tolerance=3,
snap_y_tolerance=3,
join_x_tolerance=3,
join_y_tolerance=3,
)
ixs = table.edges_to_intersections(edges)
assert len(ixs.keys()) == 304 # 38x8

0 comments on commit cbb34ce

Please sign in to comment.