-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fewer horizontal lines when using text strategy #265
Comments
#265 (comment) has been edited with table extraction related issue info. |
Thank you for the beautifully detailed bug report, @samkit-jain! My hunch is that the problem stems from the "fixes" (😬) in d224202, which were released as part of |
Yes, if I undo the changes that were made to |
Thanks for confirming that, @samkit-jain. I'll try seeing if it's possible to correct the fixes (rather than just reverting the commit), so that they retain the simplicity. Hopefully, it's just a matter of squashing a bug :) |
Having spent a little bit more time looking at this ... I've come to believe that the behavior in In this particular case, the issue seems to stem from the page number at the bottom of of the page ( If you crop the page first (and increase the intersection tolerance), the table parses as expected: import pdfplumber
pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "text",
"intersection_x_tolerance": 10,
}
cropped = p.crop((0, 120, p.width, p.height - 70))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts) (Last row of the table is, as expected: Even so, there is — as you note — probably a more robust way for
I've been puzzling over this and, unfortunately, I don't think there's an answer that will satisfy all/most cases. Even in the example in this issue, just putting the line halfway between the two rows would not solve the problem, since the |
Another option could also be to revert the change and introduce 2 new table settings parameters |
Thanks, @samkit-jain! I think giving users the option to set |
Closed core of this issue via #466 and #467 — though adding |
Describe the bug
In d224202, the logic for finding horizontal and vertical lines connecting
n
number of words was simplified. When finding the horizontal lines, the logic was updated to keep the "top" of all line rects and the bottom of only the last line rect. This is causing problems with table detection as the final number of horizontal lines has reduced and when the gap between 2 rows is big, it can provide inconsistent results when used together withsnap_tolerance
.The height of the line is also not in sync with the height of text it possesses.
Code to reproduce the problem
PDF file
The PDF file can be found here.
Expected behavior
On versions before v0.5.23
Last row of the table:
['金', '', '']
Actual behavior
On v0.5.23
Last row of the table:
['支付其他与投资活动有关的现', '', '']
Environment
Additional context
Causes trouble when the vertical spacing between 2 words is big.
To some context, the change makes sense as a horizontal row of text is sandwiched between 2 lines and there are no consecutive empty lines (as can be found in the Expected Behavior's screenshot). What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)? The current implementation has picked the top of the bottom row.
I would prefer that it is reverted to the older approach in which both the top of the bottom row and bottom of the top row were kept and leave it up to the user to filter since there is no one-filter-suits-all. What are your thoughts @jsvine ?
TODO (from: @samkit-jain ): I looked at the horizontal edges but perhaps it affected vertical edges as well and I should test that out as well.
The text was updated successfully, but these errors were encountered: