Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fewer horizontal lines when using text strategy #265

Closed
samkit-jain opened this issue Sep 1, 2020 · 8 comments
Closed

Fewer horizontal lines when using text strategy #265

samkit-jain opened this issue Sep 1, 2020 · 8 comments
Labels

Comments

@samkit-jain
Copy link
Collaborator

samkit-jain commented Sep 1, 2020

Describe the bug

In d224202, the logic for finding horizontal and vertical lines connecting n number of words was simplified. When finding the horizontal lines, the logic was updated to keep the "top" of all line rects and the bottom of only the last line rect. This is causing problems with table detection as the final number of horizontal lines has reduced and when the gap between 2 rows is big, it can provide inconsistent results when used together with snap_tolerance.

The height of the line is also not in sync with the height of text it possesses.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]
ts = {"vertical_strategy": "lines", "horizontal_strategy": "text"}
im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

tables = p.extract_tables(table_settings=ts)
for table in tables:
    for row in table:
        print(row)

PDF file

The PDF file can be found here.

Expected behavior

On versions before v0.5.23
Last row of the table: ['金', '', '']
image

Actual behavior

On v0.5.23
Last row of the table: ['支付其他与投资活动有关的现', '', '']
image

Environment

  • pdfplumber version: 0.5.23
  • Python version: 3.8.2
  • OS: Linux

Additional context

Causes trouble when the vertical spacing between 2 words is big.

To some context, the change makes sense as a horizontal row of text is sandwiched between 2 lines and there are no consecutive empty lines (as can be found in the Expected Behavior's screenshot). What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)? The current implementation has picked the top of the bottom row.
image

I would prefer that it is reverted to the older approach in which both the top of the bottom row and bottom of the top row were kept and leave it up to the user to filter since there is no one-filter-suits-all. What are your thoughts @jsvine ?

TODO (from: @samkit-jain ): I looked at the horizontal edges but perhaps it affected vertical edges as well and I should test that out as well.

@samkit-jain samkit-jain added the bug label Sep 1, 2020
@samkit-jain
Copy link
Collaborator Author

#265 (comment) has been edited with table extraction related issue info.

@jsvine
Copy link
Owner

jsvine commented Sep 2, 2020

Thank you for the beautifully detailed bug report, @samkit-jain! My hunch is that the problem stems from the "fixes" (😬) in d224202, which were released as part of 0.5.23. I'll see if I can pinpoint the specific mechanism that is causing the unexpected behavior.

@samkit-jain
Copy link
Collaborator Author

Yes, if I undo the changes that were made to words_to_edges_h(), it works fine for horizontal rows. Another possible solution apart from just reverting the commit could be to update the current method by adding a check to add a new imaginary horizontal line at the bottom of the top row when the difference between the bottom of the top row and top of the bottom row crosses a certain threshold. The tricky part in this would be selecting a global threshold value.

@jsvine
Copy link
Owner

jsvine commented Sep 2, 2020

Thanks for confirming that, @samkit-jain. I'll try seeing if it's possible to correct the fixes (rather than just reverting the commit), so that they retain the simplicity. Hopefully, it's just a matter of squashing a bug :)

@jsvine
Copy link
Owner

jsvine commented Sep 3, 2020

Having spent a little bit more time looking at this ... I've come to believe that the behavior in v0.5.23 isn't wrong, per se — although it is, as you note, unexpected.

In this particular case, the issue seems to stem from the page number at the bottom of of the page (21) getting lumped in with the rest of the table when, in fact, the vertical lines don't extend that far down.

If you crop the page first (and increase the intersection tolerance), the table parses as expected:

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "text",
    "intersection_x_tolerance": 10,
}
cropped = p.crop((0, 120, p.width, p.height - 70))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)

image

(Last row of the table is, as expected: ['金', '', ''])

Even so, there is — as you note — probably a more robust way for pdfplumber to handle situations like these. Reverting the d224202 changes is one option, but I wonder whether there's a way still to improve it. I'm going to give it another thinking session, but am open to suggestions as well.

What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)?

I've been puzzling over this and, unfortunately, I don't think there's an answer that will satisfy all/most cases. Even in the example in this issue, just putting the line halfway between the two rows would not solve the problem, since the 21 is so far from the vertical lines of the table.

@samkit-jain
Copy link
Collaborator Author

Another option could also be to revert the change and introduce 2 new table settings parameters snap_x_tolerance and snap_y_tolerance and let the user use snap_x_tolerance to combine the horizontal lines. The same can also be said for the vertical lines.

@jsvine
Copy link
Owner

jsvine commented Mar 31, 2021

Thanks, @samkit-jain! I think giving users the option to set snap_x_tolerance and snap_y_tolerance independently is a good idea, regardless. I'll add it to my todo list. I'm having trouble, though, visualizing how these options would apply to this particular issue. Could you expand the idea slightly?

@jsvine
Copy link
Owner

jsvine commented Jul 15, 2021

Closed core of this issue via #466 and #467 — though adding snap_x_tolerance and snap_y_tolerance still seems like a good idea. Thanks again @samkit-jain!

@jsvine jsvine closed this as completed Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants