Scan with OCR: words not split #26

dirksierd · 2020-04-13T17:57:16Z

With this PDF-file the words are not split. It's an OCR-scan. I tried modifying the word_margin in LAParams to no avail. When exporting the highlights using PDF Expert (my macOS-PDF Reader) it works fine though: here's the expected output.

Any thoughts?

Best regards

0xabu · 2021-03-04T21:50:46Z

This is an issue in the pdfminer library. I confirmed that:

pdfminer's pdf2txt.py tool fails in a similar way -- no spaces and far too many chars extracted
Both my PDF reader and Poppler's pdftotext utility extract the text correctly

If you do report an issue (or find an existing one) on the pdfminer project, please link it here.

0xabu added the bug label Jun 17, 2020

0xabu added the pdfminer Issue in pdfminer label Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan with OCR: words not split #26

Scan with OCR: words not split #26

dirksierd commented Apr 13, 2020

0xabu commented Mar 4, 2021

Scan with OCR: words not split #26

Scan with OCR: words not split #26

Comments

dirksierd commented Apr 13, 2020

0xabu commented Mar 4, 2021