Image table extraction correct but text table output not. #1061

mnigogos · 2023-12-14T03:13:44Z

mnigogos
Dec 14, 2023

I have had fabulous success parsing a few hundred PDF files except for one holdout. You notice while using the following statement: table_finder = page.debug_tablefinder(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines","edge_min_length": 50}) I get a great visual identification of the margins of the table . Yet when I extract the text the last items in the list get parsed incorrectly on a few lines. Note empty strings on line 4 and 6 then combined icons on lines 5 and 7. This is NOT due their being symbols-innumerable other tables are parsed just fine. I am using the same settings when doing the formal text extraction vs the visual display. I notice the 4th and 6th row height is relatively short-might that be the culprit and if so how might I address it?

[Procedure', 'Appropriateness Category', 'Relative Radiation Level']
[MRI cervical spine without IV contrast', 'Usually Appropriate', 'O']
[CT cervical spine without IV contrast', 'May Be Appropriate', '☢☢☢']
[Radiography cervical spine', 'May Be Appropriate (Disagreement)', '']
[MRI cervical spine without and with IV\ncontrast', 'Usually Not Appropriate', '☢☢\nO']
[Radiographic myelography cervical spine', 'Usually Not Appropriate', '']
[CT myelography cervical spine', 'Usually Not Appropriate', '☢☢☢\n☢☢☢☢']
[CT cervical spine with IV contrast', 'Usually Not Appropriate', '☢☢☢']
[CT cervical spine without and with IV\ncontrast', 'Usually Not Appropriate', '☢☢☢']
[CTA neck with IV contrast', 'Usually Not Appropriate', '☢☢☢']
[Discography cervical spine', 'Usually Not Appropriate', '☢☢']
[Facet injection/medial branch block cervical\nspine', 'Usually Not Appropriate', '☢☢']
[MRA neck with IV contrast', 'Usually Not Appropriate', 'O']
[MRA neck without IV contrast', 'Usually Not Appropriate', 'O']
[MRI cervical spine with IV contrast', 'Usually Not Appropriate', 'O']
[Bone scan whole body with SPECT or\nSPECT/CT neck', 'Usually Not Appropriate', '☢☢☢']

jsvine · 2023-12-21T18:27:12Z

jsvine
Dec 21, 2023
Maintainer

Hi @mnigogos, and thank you for providing a detailed description of the issue you're facing. Are you able to provide the PDF itself? Without it, it will be difficult to diagnose directly. But one guess, based on the output: It's possible that the bounding boxes of some of the characters that are being misplaced are either much larger than the characters themselves appear, or more generally erroneous. One way to test this:

im = page.to_image()
im..debug_tablefinder(
  table_settings={
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "edge_min_length": 50
  }
)
im.draw_rects(page.chars)

2 replies

mnigogos Dec 22, 2023
Author

Cervical Neck Pain or Cervical Radiculopathy.pdf

Thank you for a response! I've attached the file and created the following code using your suggestions (showing it to you as I am a newbie coder):

import pdfplumber
from IPython.display import Image, display
crop_box = (0, 0, 612, 730)
filePath = 'Cervical Neck Pain or Cervical Radiculopathy.pdf'
with pdfplumber.open(filePath) as pdf:
for i, page in enumerate(pdf.pages):

    page = page.crop(crop_box)
    im = page.to_image(resolution=200)

    table_settings = {  "vertical_strategy": "lines",   "horizontal_strategy": "lines",  "edge_min_length": 50}
    tables = page.find_tables(table_settings=table_settings)
    im.debug_tablefinder()
    im.draw_rects(page.chars)

    image_path = f"output_image_page_{i}.jpg"
    im.save(image_path, format='JPEG')
    display(Image(filename=image_path))

The second page of my output is as below. I notice that the highlighting is offset for all of the icons in the 3rd row but most pronounced in the 4th and 6th row such that the highlighting majority is over the defined cell margin. Is that the issue and if so how do I fix that? As a final note the original parsing in the previous image post above showed the merged icons in the following line to include a \n-is that generated by pdfplumber or is that part of the problem?

jsvine Jan 7, 2024
Maintainer

Thank you for sharing the PDF, code, and output. There seem to be a few different things going on. Note: I'm using the first page, rather than the second, because it seems to be more difficult.

One is that there are some stray graphical elements encoded on the page, interfering with the table extraction. I noticed this by running:

page = pdf.pages[0]
im = page.to_image()
im.reset().debug_tablefinder()

... and getting this:

There are a few ways to fix the issue, but perhaps the simplest is to adjust the snap_tolerance table setting:

im.reset().debug_tablefinder({
  "snap_tolerance": 10,
})

... produces this, which you can see cleans things up:

Now running this:

page.extract_table({
    "snap_tolerance": 10,
})

... gets us almost to the desired result (note: I'm using • as a stand-in for the newline character just for easier display):

Procedure	Appropriateness Category	Relative Radiation Level
Radiography cervical spine	Usually Appropriate	☢☢
MRI cervical spine without IV contrast	May Be Appropriate (Disagreement)	O
CT cervical spine without IV contrast	May Be Appropriate	☢☢☢
CT cervical spine with IV contrast	Usually Not Appropriate	☢☢☢
MRI cervical spine without and with IV•contrast	Usually Not Appropriate	O
CT cervical spine without and with IV•contrast	Usually Not Appropriate	☢☢☢
CT myelography cervical spine	Usually Not Appropriate	☢☢☢☢
CTA neck with IV contrast	Usually Not Appropriate	☢☢☢
Discography cervical spine	Usually Not Appropriate	☢☢
Facet injection/medial branch block cervical•spine	Usually Not Appropriate	☢☢
MRA neck with IV contrast	Usually Not Appropriate	O
MRA neck without IV contrast	Usually Not Appropriate	O
MRI cervical spine with IV contrast	Usually Not Appropriate	O
Bone scan whole body with SPECT or•SPECT/CT neck	Usually Not Appropriate	☢☢☢
Radiographic myelography cervical spine	Usually Not Appropriate

As you can see, there's one glaring issue: The bottom-right cell is empty, when it should be ☢☢☢. That's because, as you can see here (via im.reset().draw_rects(page.chars)), that set of characters is more than 50% outside the cell borders:

Unfortunately, this stems from something largely out of pdfplumber's control, which is the character-bounding box identifications. And unfortunately, there's not a super elegant, robust fix within pdfplumber. But here's one approach that will work in this case — taking the strategy of cropping the characters to the table boundary so that now their (trimmed) bounding boxes will be fully within the table:

cropped = page.crop(page.find_table().bbox)
cropped.extract_table({
    "snap_tolerance": 10,
})

... producing:

Procedure	Appropriateness Category	Relative Radiation Level
Radiography cervical spine	Usually Appropriate	☢☢
MRI cervical spine without IV contrast	May Be Appropriate (Disagreement)	O
CT cervical spine without IV contrast	May Be Appropriate	☢☢☢
CT cervical spine with IV contrast	Usually Not Appropriate	☢☢☢
MRI cervical spine without and with IV•contrast	Usually Not Appropriate	O
CT cervical spine without and with IV•contrast	Usually Not Appropriate	☢☢☢
CT myelography cervical spine	Usually Not Appropriate	☢☢☢☢
CTA neck with IV contrast	Usually Not Appropriate	☢☢☢
Discography cervical spine	Usually Not Appropriate	☢☢
Facet injection/medial branch block cervical•spine	Usually Not Appropriate	☢☢
MRA neck with IV contrast	Usually Not Appropriate	O
MRA neck without IV contrast	Usually Not Appropriate	O
MRI cervical spine with IV contrast	Usually Not Appropriate	O
Bone scan whole body with SPECT or•SPECT/CT neck	Usually Not Appropriate	☢☢☢
Radiographic myelography cervical spine	Usually Not Appropriate	☢☢☢

mnigogos · 2024-02-25T14:01:44Z

mnigogos
Feb 25, 2024
Author

You guys rock! Thanks so much-I'll incorporate your suggestions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image table extraction correct but text table output not. #1061

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Image table extraction correct but text table output not. #1061

mnigogos Dec 14, 2023

Replies: 2 comments · 2 replies

jsvine Dec 21, 2023 Maintainer

mnigogos Dec 22, 2023 Author

jsvine Jan 7, 2024 Maintainer

mnigogos Feb 25, 2024 Author

mnigogos
Dec 14, 2023

Replies: 2 comments 2 replies

jsvine
Dec 21, 2023
Maintainer

mnigogos Dec 22, 2023
Author

jsvine Jan 7, 2024
Maintainer

mnigogos
Feb 25, 2024
Author