Use OCR on a PDF and show just the OCR output text #1291

b01000100 · 2024-04-09T19:12:20Z

b01000100
Apr 9, 2024

I'd like to apologize ahead of time - I'm not sure how to ask what I'm after.

I have some PDFs saved that I'd like to convert to just straight text. I am running a docker container on my unraid server that has ocrmypdf-auto in it. It waits for a PDF to land in an input directory. Once there is a PDF, it picks it up, processes it, saves it to an output directory, and then archives the original. It works great. I've passed in a few different types of PDFs and have gotten output PDFs that I can then copy/paste from. The problem, for lack of a better term, is that it still shows the original text in the output PDF. As I understand it, the OCR text is in an invisible layer on the page.

I currently have a config file that has --force-ocr in it, but I tried it without a config file first. I didn't think --force-ocr would do anything in my context, but I tried it anyway. I was wondering if there was a way with ocrmypdf to run the documents through and spit out a PDF containing just the OCR text.

So it would go from this:

to this:

I don't care about the look of the original; I'm more concerned with having easily read text on the page. Some of these are old books I am digitizing and some are just old documents of mine. Some of the old books have pretty rough output after I scan them in or snap a picture of them. If I'm going to read them on a tablet, I'd rather have crystal clear text over the original look of the item.

I know I can just highlight the text and paste it into another document, but I'm after a more automated way of handling that. Is this possible with OCRmyPDF? If not, could anyone point me in the right direction?

Answered by jbarlow83

Apr 9, 2024

ocrmypdf can't quite do this on its own, since it renders an invisible font.

There are some commercial OCR engines that can attempt to reconstruct a document when the font is recognized and give you an editable document as output. That's a beyond what the open source tech available lets us do - we don't have an open source OCR engine that distinguishes fonts or does precision text layout. Although since you're not as concerned about the exact layout you have more options.

You can use pdftotext (maybe with -layout) to extract the text from the finished PDF.

You can also use ocrmypdf --sidecar to generate text files containing the OCR output. Note that in a document with mixed vector/raster…

View full answer

jbarlow83 · 2024-04-09T19:34:00Z

jbarlow83
Apr 9, 2024
Maintainer

ocrmypdf can't quite do this on its own, since it renders an invisible font.

There are some commercial OCR engines that can attempt to reconstruct a document when the font is recognized and give you an editable document as output. That's a beyond what the open source tech available lets us do - we don't have an open source OCR engine that distinguishes fonts or does precision text layout. Although since you're not as concerned about the exact layout you have more options.

You can use pdftotext (maybe with -layout) to extract the text from the finished PDF.

You can also use ocrmypdf --sidecar to generate text files containing the OCR output. Note that in a document with mixed vector/raster PDF content, only the OCR gets picked up this way, while pdftotext gets all text regardless of origin.

Another option might be to use something like Calibre to convert the PDF or text file from the above programs to an epub file which would allow reflowing/repaginating on a tablet.

1 reply

b01000100 Apr 9, 2024
Author

Awesome, thanks for the reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use OCR on a PDF and show just the OCR output text #1291

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Use OCR on a PDF and show just the OCR output text #1291

b01000100 Apr 9, 2024

Replies: 1 comment · 1 reply

jbarlow83 Apr 9, 2024 Maintainer

b01000100 Apr 9, 2024 Author

b01000100
Apr 9, 2024

Replies: 1 comment 1 reply

jbarlow83
Apr 9, 2024
Maintainer

b01000100 Apr 9, 2024
Author