Wrong coordinates of words when using function extract_words() #799

datdao1998 · 2023-01-30T16:00:31Z

Description

When using function extract_words(), the coordinates of some extracted words are wrong, in my case word['x0'] = word['x1'] (but word['text'] still correct)

Code to reproduce the problem

import pdfplumber

pdf_path = 'test.pdf'

with pdfplumber.open(pdf_path) as pdf:
      pages = pdf.pages
      for page in pages:
           words = page.extract_words()
           for word in words:
                 print(word['x0'], word['x1'], word['text'])

Screenshots

Output

Visualize text box

Environment

pdfplumber version: 0.6.0
Python version: 3.9.12
OS: Linux

The text was updated successfully, but these errors were encountered:

jsvine · 2023-02-01T00:21:40Z

Hi @datdao1998, could you provide the PDF that you're using? Without it, it will be very difficult to diagnose your issue.

jsvine · 2023-02-22T13:49:49Z

Hi @datdao1998, just checking back on this. Are you able to provide the PDF? You might also try repairing the PDF and seeing if that fixes the problem you've encountered.

sandzone · 2023-02-23T19:05:45Z

This is definitely happening. Its not just the extract_words() function.

.chars itself has wrong coordinates for the characters.

Some of the coordinates for these words are even outside the page's BoundingBox.

I can email you the PDF.

jsvine · 2023-02-23T20:25:05Z

Thanks, @sandzone. Please do email me the PDF; my email address is in my profile. And have you tried repairing the PDF?

sandzone · 2023-02-24T04:39:51Z

Thanks. You are correct. Repairing the pdf resolved the issue. However, ghostscript couldn't repair - i had to use poppler command line utilities for that.

Is there a way to integrate pdf repair as a part of pdfplumber's extraction features?

jsvine · 2023-02-24T19:46:22Z

Thanks for confirming, @sandzone. And that's an interesting idea. I've opened a separate issue for that here: #824

datdao1998 added the bug label Jan 30, 2023

jsvine added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Feb 1, 2023

jsvine mentioned this issue Feb 24, 2023

Add repair method? #824

Closed

jsvine closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong coordinates of words when using function extract_words() #799

Wrong coordinates of words when using function extract_words() #799

datdao1998 commented Jan 30, 2023

jsvine commented Feb 1, 2023

jsvine commented Feb 22, 2023

sandzone commented Feb 23, 2023

jsvine commented Feb 23, 2023

sandzone commented Feb 24, 2023

jsvine commented Feb 24, 2023

Wrong coordinates of words when using function extract_words() #799

Wrong coordinates of words when using function extract_words() #799

Comments

datdao1998 commented Jan 30, 2023

Description

Code to reproduce the problem

Screenshots

Environment

jsvine commented Feb 1, 2023

jsvine commented Feb 22, 2023

sandzone commented Feb 23, 2023

jsvine commented Feb 23, 2023

sandzone commented Feb 24, 2023

jsvine commented Feb 24, 2023