Specify DPI for Tesseract through OCRmyPDF #1298
Unanswered
TDavLinguist
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(NOTE: Using Python 3.8, so I'm stuck on ocrmypdf v14.4.0 at the moment)
First off, thanks developers for this amazing tool! I have utilized it a lot recently, as I'm trying to make a FOSS OCR workflow that can mimic most behaviors of ABBYY FineReader.
Before integrating OCRmyPDF into my workflow, I was doing things the old-fashioned way by extracting images out of the PDF and performing tesseract-ocr on each one. What I noticed when using Tesseract directly is that Tesseract often guessed the DPI incorrectly (usually defaulting to 70 DPI). When it did that, the page would often have line breaks between words on the same line. So, for example, if I wanted to search for "See dog run", but the OCR is effectively "See\ndog\nrun", the search doesn't work. So, the solution was to manually calculate the DPI (usually around 200-300 for the scans I saw) and tell Tesseract to set the DPI with
--dpi 300
.My issue currently with some books OCR'd with OCRmyPDF is that there is nothing in the manuals nor in
--help
that suggests I can pass the--dpi 300
option to tesseract. There is upscale and downscale, yes, but that's not quite what I want. Even when I tried using upscale to 300, it took up most of my CPU to try to OCR (especially with 12 jobs going at once!)Can anyone recommend a workaround or solution that doesn't involve me going back to the imagestack?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions