Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong coordinates of words when using function extract_words() #799

Closed
datdao1998 opened this issue Jan 30, 2023 · 6 comments
Closed

Wrong coordinates of words when using function extract_words() #799

datdao1998 opened this issue Jan 30, 2023 · 6 comments
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug

Comments

@datdao1998
Copy link

Description

When using function extract_words(), the coordinates of some extracted words are wrong, in my case word['x0'] = word['x1'] (but word['text'] still correct)

Code to reproduce the problem

import pdfplumber

pdf_path = 'test.pdf'

with pdfplumber.open(pdf_path) as pdf:
      pages = pdf.pages
      for page in pages:
           words = page.extract_words()
           for word in words:
                 print(word['x0'], word['x1'], word['text'])

Screenshots

Output

Screen Shot 2023-01-30 at 22 59 21

Visualize text box

Screen Shot 2023-01-30 at 22 58 33

Environment

  • pdfplumber version: 0.6.0
  • Python version: 3.9.12
  • OS: Linux
@datdao1998 datdao1998 added the bug label Jan 30, 2023
@jsvine jsvine added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Feb 1, 2023
@jsvine
Copy link
Owner

jsvine commented Feb 1, 2023

Hi @datdao1998, could you provide the PDF that you're using? Without it, it will be very difficult to diagnose your issue.

@jsvine
Copy link
Owner

jsvine commented Feb 22, 2023

Hi @datdao1998, just checking back on this. Are you able to provide the PDF? You might also try repairing the PDF and seeing if that fixes the problem you've encountered.

@sandzone
Copy link

This is definitely happening. Its not just the extract_words() function.

.chars itself has wrong coordinates for the characters.

Some of the coordinates for these words are even outside the page's BoundingBox.

image

I can email you the PDF.

@jsvine
Copy link
Owner

jsvine commented Feb 23, 2023

Thanks, @sandzone. Please do email me the PDF; my email address is in my profile. And have you tried repairing the PDF?

@sandzone
Copy link

Thanks. You are correct. Repairing the pdf resolved the issue. However, ghostscript couldn't repair - i had to use poppler command line utilities for that.

Is there a way to integrate pdf repair as a part of pdfplumber's extraction features?

@jsvine
Copy link
Owner

jsvine commented Feb 24, 2023

Thanks for confirming, @sandzone. And that's an interesting idea. I've opened a separate issue for that here: #824

@jsvine jsvine closed this as completed Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author bug
Projects
None yet
Development

No branches or pull requests

3 participants