Fewer horizontal lines when using text strategy #265

samkit-jain · 2020-09-01T19:46:49Z

Describe the bug

In d224202, the logic for finding horizontal and vertical lines connecting n number of words was simplified. When finding the horizontal lines, the logic was updated to keep the "top" of all line rects and the bottom of only the last line rect. This is causing problems with table detection as the final number of horizontal lines has reduced and when the gap between 2 rows is big, it can provide inconsistent results when used together with snap_tolerance.

The height of the line is also not in sync with the height of text it possesses.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]
ts = {"vertical_strategy": "lines", "horizontal_strategy": "text"}
im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

tables = p.extract_tables(table_settings=ts)
for table in tables:
    for row in table:
        print(row)

PDF file

The PDF file can be found here.

Expected behavior

On versions before v0.5.23
Last row of the table: ['金', '', '']

Actual behavior

On v0.5.23
Last row of the table: ['支付其他与投资活动有关的现', '', '']

Environment

pdfplumber version: 0.5.23
Python version: 3.8.2
OS: Linux

Additional context

Causes trouble when the vertical spacing between 2 words is big.

To some context, the change makes sense as a horizontal row of text is sandwiched between 2 lines and there are no consecutive empty lines (as can be found in the Expected Behavior's screenshot). What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)? The current implementation has picked the top of the bottom row.

I would prefer that it is reverted to the older approach in which both the top of the bottom row and bottom of the top row were kept and leave it up to the user to filter since there is no one-filter-suits-all. What are your thoughts @jsvine ?

TODO (from: @samkit-jain ): I looked at the horizontal edges but perhaps it affected vertical edges as well and I should test that out as well.

The text was updated successfully, but these errors were encountered:

samkit-jain · 2020-09-01T19:56:55Z

#265 (comment) has been edited with table extraction related issue info.

jsvine · 2020-09-02T00:07:39Z

Thank you for the beautifully detailed bug report, @samkit-jain! My hunch is that the problem stems from the "fixes" (😬) in d224202, which were released as part of 0.5.23. I'll see if I can pinpoint the specific mechanism that is causing the unexpected behavior.

samkit-jain · 2020-09-02T01:43:31Z

Yes, if I undo the changes that were made to words_to_edges_h(), it works fine for horizontal rows. Another possible solution apart from just reverting the commit could be to update the current method by adding a check to add a new imaginary horizontal line at the bottom of the top row when the difference between the bottom of the top row and top of the bottom row crosses a certain threshold. The tricky part in this would be selecting a global threshold value.

jsvine · 2020-09-02T13:02:07Z

Thanks for confirming that, @samkit-jain. I'll try seeing if it's possible to correct the fixes (rather than just reverting the commit), so that they retain the simplicity. Hopefully, it's just a matter of squashing a bug :)

jsvine · 2020-09-03T02:50:09Z

Having spent a little bit more time looking at this ... I've come to believe that the behavior in v0.5.23 isn't wrong, per se — although it is, as you note, unexpected.

In this particular case, the issue seems to stem from the page number at the bottom of of the page (21) getting lumped in with the rest of the table when, in fact, the vertical lines don't extend that far down.

If you crop the page first (and increase the intersection tolerance), the table parses as expected:

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "text",
    "intersection_x_tolerance": 10,
}
cropped = p.crop((0, 120, p.width, p.height - 70))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)

(Last row of the table is, as expected: ['金', '', ''])

Even so, there is — as you note — probably a more robust way for pdfplumber to handle situations like these. Reverting the d224202 changes is one option, but I wonder whether there's a way still to improve it. I'm going to give it another thinking session, but am open to suggestions as well.

What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)?

I've been puzzling over this and, unfortunately, I don't think there's an answer that will satisfy all/most cases. Even in the example in this issue, just putting the line halfway between the two rows would not solve the problem, since the 21 is so far from the vertical lines of the table.

samkit-jain · 2021-03-22T16:45:24Z

Another option could also be to revert the change and introduce 2 new table settings parameters snap_x_tolerance and snap_y_tolerance and let the user use snap_x_tolerance to combine the horizontal lines. The same can also be said for the vertical lines.

jsvine · 2021-03-31T00:28:07Z

Thanks, @samkit-jain! I think giving users the option to set snap_x_tolerance and snap_y_tolerance independently is a good idea, regardless. I'll add it to my todo list. I'm having trouble, though, visualizing how these options would apply to this particular issue. Could you expand the idea slightly?

jsvine · 2021-07-15T14:04:09Z

Closed core of this issue via #466 and #467 — though adding snap_x_tolerance and snap_y_tolerance still seems like a good idea. Thanks again @samkit-jain!

samkit-jain added the bug label Sep 1, 2020

samkit-jain mentioned this issue Sep 9, 2020

Use the extract_table() method to parse out such a table #268

Closed

jsvine mentioned this issue Jul 15, 2021

Missing last row from intermediate tables, when using mixed strategy #466

Closed

jsvine closed this as completed Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fewer horizontal lines when using text strategy #265

Fewer horizontal lines when using text strategy #265

samkit-jain commented Sep 1, 2020 •

edited

Loading

samkit-jain commented Sep 1, 2020

jsvine commented Sep 2, 2020

samkit-jain commented Sep 2, 2020

jsvine commented Sep 2, 2020

jsvine commented Sep 3, 2020

samkit-jain commented Mar 22, 2021

jsvine commented Mar 31, 2021

jsvine commented Jul 15, 2021

Fewer horizontal lines when using text strategy #265

Fewer horizontal lines when using text strategy #265

Comments

samkit-jain commented Sep 1, 2020 • edited Loading

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Environment

Additional context

samkit-jain commented Sep 1, 2020

jsvine commented Sep 2, 2020

samkit-jain commented Sep 2, 2020

jsvine commented Sep 2, 2020

jsvine commented Sep 3, 2020

samkit-jain commented Mar 22, 2021

jsvine commented Mar 31, 2021

jsvine commented Jul 15, 2021

samkit-jain commented Sep 1, 2020 •

edited

Loading