Specify DPI for Tesseract through OCRmyPDF #1298

TDavLinguist · 2024-04-25T06:01:17Z

TDavLinguist
Apr 25, 2024

(NOTE: Using Python 3.8, so I'm stuck on ocrmypdf v14.4.0 at the moment)

First off, thanks developers for this amazing tool! I have utilized it a lot recently, as I'm trying to make a FOSS OCR workflow that can mimic most behaviors of ABBYY FineReader.

Before integrating OCRmyPDF into my workflow, I was doing things the old-fashioned way by extracting images out of the PDF and performing tesseract-ocr on each one. What I noticed when using Tesseract directly is that Tesseract often guessed the DPI incorrectly (usually defaulting to 70 DPI). When it did that, the page would often have line breaks between words on the same line. So, for example, if I wanted to search for "See dog run", but the OCR is effectively "See\ndog\nrun", the search doesn't work. So, the solution was to manually calculate the DPI (usually around 200-300 for the scans I saw) and tell Tesseract to set the DPI with --dpi 300.

My issue currently with some books OCR'd with OCRmyPDF is that there is nothing in the manuals nor in --help that suggests I can pass the --dpi 300 option to tesseract. There is upscale and downscale, yes, but that's not quite what I want. Even when I tried using upscale to 300, it took up most of my CPU to try to OCR (especially with 12 jobs going at once!)

Can anyone recommend a workaround or solution that doesn't involve me going back to the imagestack?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify DPI for Tesseract through OCRmyPDF #1298

{{title}}

Replies: 0 comments

Select a reply

Specify DPI for Tesseract through OCRmyPDF #1298

TDavLinguist Apr 25, 2024

Replies: 0 comments

TDavLinguist
Apr 25, 2024