Extracting table does not use correct borders #1244
tz850
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 1 reply
-
Hi @tz850, and thanks for providing the PDF and visual debugging output. This is an interesting edge case, where the page has a lot of other graphical objects ( im.reset().debug_tablefinder({
"snap_tolerance": 0,
}) ... seems to get you want you'd want: |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the bug
When extracting the table content on the page, it does not follow the table borders, but seems to use the label of the chart below.
data:image/s3,"s3://crabby-images/54626/54626b98bcb5f7f8aec0881ae1ada321e5e6f7ce" alt="txt1"
data:image/s3,"s3://crabby-images/66261/662611fde1620245cc993db8a2b06b177e7f42ac" alt="txt2"
This problem causes the text extracted from the cell to be incorrect. For example, in the third and fourth columns of the first row, the correct text should be "
",
but the actual extract_tables function‘s result is "
"
This is the original page.
data:image/s3,"s3://crabby-images/c6be5/c6be5b0126ec3faa28c91c450fc12bb2f06a3d1b" alt="image"
Here is an image of the debug_tablefinder output. The cell borders pointed to by the arrows in the figure are not the actual borders of the table.
data:image/s3,"s3://crabby-images/2a89d/2a89dc29fa5c7271d96b7c303f733c058c75da3d" alt="tableimage"
Have you tried repairing the PDF?
I run ghostscript directly to output the repaired pdf file, and the problem is the same.
Code to reproduce the problem
PDF file
sample.pdf
Expected behavior
I hope that the division of table cells will not be affected by other tables or charts.
Actual behavior
What actually happened, instead?
Environment
Additional context
Add any other context/notes about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions