Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan with OCR: words not split #26

Open
dirksierd opened this issue Apr 13, 2020 · 1 comment
Open

Scan with OCR: words not split #26

dirksierd opened this issue Apr 13, 2020 · 1 comment
Labels
bug pdfminer Issue in pdfminer

Comments

@dirksierd
Copy link

With this PDF-file the words are not split. It's an OCR-scan. I tried modifying the word_margin in LAParams to no avail. When exporting the highlights using PDF Expert (my macOS-PDF Reader) it works fine though: here's the expected output.

Any thoughts?

Best regards

@0xabu 0xabu added the bug label Jun 17, 2020
@0xabu
Copy link
Owner

0xabu commented Mar 4, 2021

This is an issue in the pdfminer library. I confirmed that:

  • pdfminer's pdf2txt.py tool fails in a similar way -- no spaces and far too many chars extracted
  • Both my PDF reader and Poppler's pdftotext utility extract the text correctly

If you do report an issue (or find an existing one) on the pdfminer project, please link it here.

@0xabu 0xabu added the pdfminer Issue in pdfminer label Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug pdfminer Issue in pdfminer
Projects
None yet
Development

No branches or pull requests

2 participants