-
I'd like to apologize ahead of time - I'm not sure how to ask what I'm after. I have some PDFs saved that I'd like to convert to just straight text. I am running a docker container on my unraid server that has ocrmypdf-auto in it. It waits for a PDF to land in an input directory. Once there is a PDF, it picks it up, processes it, saves it to an output directory, and then archives the original. It works great. I've passed in a few different types of PDFs and have gotten output PDFs that I can then copy/paste from. The problem, for lack of a better term, is that it still shows the original text in the output PDF. As I understand it, the OCR text is in an invisible layer on the page. I currently have a config file that has --force-ocr in it, but I tried it without a config file first. I didn't think --force-ocr would do anything in my context, but I tried it anyway. I was wondering if there was a way with ocrmypdf to run the documents through and spit out a PDF containing just the OCR text. So it would go from this: to this: I don't care about the look of the original; I'm more concerned with having easily read text on the page. Some of these are old books I am digitizing and some are just old documents of mine. Some of the old books have pretty rough output after I scan them in or snap a picture of them. If I'm going to read them on a tablet, I'd rather have crystal clear text over the original look of the item. I know I can just highlight the text and paste it into another document, but I'm after a more automated way of handling that. Is this possible with OCRmyPDF? If not, could anyone point me in the right direction? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
ocrmypdf can't quite do this on its own, since it renders an invisible font. There are some commercial OCR engines that can attempt to reconstruct a document when the font is recognized and give you an editable document as output. That's a beyond what the open source tech available lets us do - we don't have an open source OCR engine that distinguishes fonts or does precision text layout. Although since you're not as concerned about the exact layout you have more options. You can use You can also use Another option might be to use something like Calibre to convert the PDF or text file from the above programs to an epub file which would allow reflowing/repaginating on a tablet. |
Beta Was this translation helpful? Give feedback.
ocrmypdf can't quite do this on its own, since it renders an invisible font.
There are some commercial OCR engines that can attempt to reconstruct a document when the font is recognized and give you an editable document as output. That's a beyond what the open source tech available lets us do - we don't have an open source OCR engine that distinguishes fonts or does precision text layout. Although since you're not as concerned about the exact layout you have more options.
You can use
pdftotext
(maybe with-layout
) to extract the text from the finished PDF.You can also use
ocrmypdf --sidecar
to generate text files containing the OCR output. Note that in a document with mixed vector/raster…