Errors with table identification in PDF (false positives) #1227

kcbz · 2024-11-12T07:16:17Z

kcbz
Nov 12, 2024

I believe this is a bug, but I have a PDF of only text and for every page of the PDF, pdfplumber believes that the entire content of each page belongs to a table. For my purposes I need tables and text to identify components of the document correctly and separately (I know text captures table data but I created a work around for this). I've never really seen this reverse case where pdfplumber thinks there are tables when there are not any.

I have tried playing with the table_settings, but this didn't fix the issue, I also tried using the debug_tablefinder() and it seems to confirm that it thinks the contents of every page is a table.

I attached the PDF below:
emeryville-ca-TITLE_10_TIDELANDS.pdf

jsvine · 2024-11-21T12:52:36Z

jsvine
Nov 21, 2024
Maintainer

Hi @kcbz, this doesn't appear to be a bug, but rather a tricky aspect of this PDF, which is that it contains some not-visible rects:

import pdfplumber
pdf = pdfplumber.open("emeryville-ca-TITLE_10_TIDELANDS.pdf")
page = pdf.pages[0]
im = page.to_image()
im.draw_rects(page.rects)

In cases like these, I recommend examining the page.rects objects directly. In this case, they're not appearing visible because they're white — i.e., 'non_stroking_color': (1, 1, 1) — on a white background.

To ignore those rects, you can use the page.filter(...) method (example in a few discussions and issues, e.g., #1219 (reply in thread)).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors with table identification in PDF (false positives) #1227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Errors with table identification in PDF (false positives) #1227

kcbz Nov 12, 2024

Replies: 1 comment

jsvine Nov 21, 2024 Maintainer

kcbz
Nov 12, 2024

jsvine
Nov 21, 2024
Maintainer