Why might ocrmypdf give worse results than tesseract by itself? #1270

nikitar · 2024-03-04T06:10:25Z

nikitar
Mar 4, 2024

I'm using this pdf as a test, which is slightly blurry but still quite readable. Tesseract (through pytesseract) seems to work fine with it, on default settings.

    image = cv2.imread(INPUT_PATH_PNG)
    data = pytesseract.image_to_string(image, lang='eng')
    print(data)

But when I use ocrmypdf 16.0.3, with or without clean, a lot of the text is mangled. E.g. "In facsimile a photocell" becomes "In facsimi le photocell" at the beginning of the second paragraph. Indefinite article 'a' seems entirely skipped (screenshot). Am I using it incorrectly?

    ocrmypdf.ocr(INPUT_PATH, OUTPUT_PATH, output_type='pdf', optimize=1, clean=True)

I can get slightly better results with some opencv pre-processing, along the lines described here, but only slightly.

I had a look at _exec/tesseract.py, but don't either library enough to see which arguments might be making the difference.

nikitar · 2024-03-06T01:50:33Z

nikitar
Mar 6, 2024
Author

Realised the issue. I installed the OCRmyPDF-EasyOCR plugin during my first tests months ago, and ocrmypdf was using it when I was testing last week.

So conversion quality was an EasyOCR problem. (I get roughly the same issues with their online demo)

Would it be reasonable to print the plugin list in ocrmypdf --version output? If I do end up using the project, I should be able to make a pr myself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why might ocrmypdf give worse results than tesseract by itself? #1270

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why might ocrmypdf give worse results than tesseract by itself? #1270

nikitar Mar 4, 2024

Replies: 1 comment

nikitar Mar 6, 2024 Author

nikitar
Mar 4, 2024

nikitar
Mar 6, 2024
Author