Image table extraction correct but text table output not. #1061
Closed
mnigogos
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments 2 replies
-
Hi @mnigogos, and thank you for providing a detailed description of the issue you're facing. Are you able to provide the PDF itself? Without it, it will be difficult to diagnose directly. But one guess, based on the output: It's possible that the bounding boxes of some of the characters that are being misplaced are either much larger than the characters themselves appear, or more generally erroneous. One way to test this: im = page.to_image()
im..debug_tablefinder(
table_settings={
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"edge_min_length": 50
}
)
im.draw_rects(page.chars) |
Beta Was this translation helpful? Give feedback.
2 replies
-
You guys rock! Thanks so much-I'll incorporate your suggestions. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have had fabulous success parsing a few hundred PDF files except for one holdout. You notice while using the following statement: table_finder = page.debug_tablefinder(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines","edge_min_length": 50}) I get a great visual identification of the margins of the table . Yet when I extract the text the last items in the list get parsed incorrectly on a few lines. Note empty strings on line 4 and 6 then combined icons on lines 5 and 7. This is NOT due their being symbols-innumerable other tables are parsed just fine. I am using the same settings when doing the formal text extraction vs the visual display. I notice the 4th and 6th row height is relatively short-might that be the culprit and if so how might I address it?
[Procedure', 'Appropriateness Category', 'Relative Radiation Level']
data:image/s3,"s3://crabby-images/03253/03253206cd3b6dce4eb5647b891090b49b14bd1c" alt="table"
[MRI cervical spine without IV contrast', 'Usually Appropriate', 'O']
[CT cervical spine without IV contrast', 'May Be Appropriate', '☢☢☢']
[Radiography cervical spine', 'May Be Appropriate (Disagreement)', '']
[MRI cervical spine without and with IV\ncontrast', 'Usually Not Appropriate', '☢☢\nO']
[Radiographic myelography cervical spine', 'Usually Not Appropriate', '']
[CT myelography cervical spine', 'Usually Not Appropriate', '☢☢☢\n☢☢☢☢']
[CT cervical spine with IV contrast', 'Usually Not Appropriate', '☢☢☢']
[CT cervical spine without and with IV\ncontrast', 'Usually Not Appropriate', '☢☢☢']
[CTA neck with IV contrast', 'Usually Not Appropriate', '☢☢☢']
[Discography cervical spine', 'Usually Not Appropriate', '☢☢']
[Facet injection/medial branch block cervical\nspine', 'Usually Not Appropriate', '☢☢']
[MRA neck with IV contrast', 'Usually Not Appropriate', 'O']
[MRA neck without IV contrast', 'Usually Not Appropriate', 'O']
[MRI cervical spine with IV contrast', 'Usually Not Appropriate', 'O']
[Bone scan whole body with SPECT or\nSPECT/CT neck', 'Usually Not Appropriate', '☢☢☢']
Beta Was this translation helpful? Give feedback.
All reactions