Implement optional downsampling as part of preprocessing #443

jbarlow83 · 2019-10-31T22:59:30Z

(As requested by @tcurdt)

Some users want optional downsampling of images, but we don't want to do this all the time in case the file has high resolution scans (e.g. film).

Like Acrobat, should downsample images above some threshold.

TBD:

What to do if we don't have any reason to flatten the PDF to images (no deskew, clean) - we should probably downsample on an image by image basis, rather than page by page, in this case, which suggests putting downsampling in the optimizer
Generally it makes sense to downsample any image that would be too big Tesseract, because at least then we can do something with it
Test image before OCR and downsample images that are too big and then scale them up
To a degree it is better to aggressively lower JPEG quality than downsample, because JPEG encoders will save the bit budget for high detail areas. This starts to break down for very large images because they take up a lot of memory when decompressed and make the file unwieldy. So we may want to satisfy a request to downsample with some combination of downsampling and JPEG quality lowering.

heinrich-ulbricht · 2020-04-23T18:51:49Z

Does this ticket cover the following use case, where I need to "emulate a scanner":

take high quality photos of documents with a smartphone camera (JPG) ("scanning" the documents)
create a PDF from those images using e.g. img2pdf (note: the PDF now is quite big)
use OCRmyPDF to add the OCR text, which gives good results because the images are of such high quality (this currently is possible with OCRmyPDF)
now my wish: after doing OCR, convert the high quality images to really small monochrome ones (like monochrome TIFF with G4 compression) and save them to the PDF, replacing the original images - this can reduce the file size by a factor of 10

Expected result: a really small PDF with OCRd text based on high quality images shown on low quality images.

What is my goal? I want to delete the original images and also don't need high quality images in the PDF. I only roughly need to see what the documents look like. But I need the text for full text search.

To my understanding this is currantly not possible. Either I feed the high quality PDF into OCRmyPDF, get good OCR results and some compression of the images. OR I reduce the PDF/image quality before feeding them to OCRmyPDF which results in a much smaller file but not as good OCR results as with the original images.

The best thing for me would be some kind of callback where I could specify a command line that would be executed on the images that are about to be inserted into the PDF but after doing OCR. I'd like to execute a command like this (which generates a small monochrome TIFF from the JPEG):

jpegtopnm image.jpg | pamthreshold | pamtotiff -g4 > image.tiff

(This might be easy to implement if the image currently is placed somewhere in a temporary directory.)

Does this make sense? Is there something like this possible already? Does this feature suggestion need to be placed in a new ticket?

tcurdt · 2020-04-23T19:25:38Z

@heinrich-ulbricht My use case is about reducing the embedded image size - but not necessarily completely replacing them. But the outcomes are pretty much aligned. Archiving a PDF with OCR data and reduced image data.

jbarlow83 · 2020-04-24T06:40:13Z

@heinrich-ulbricht You are correct that what you are describing is not currently possible with OCRmyPDF.

heinrich-ulbricht · 2020-04-24T09:14:12Z

I created a ticket specific to my use case to not highjack this ticket further: #541

jbarlow83 · 2023-10-10T08:06:01Z

ocrmypdf now picks the output resolution more carefully, so this is partially done.

jbarlow83 added the enhancement label Oct 31, 2019

heinrich-ulbricht mentioned this issue Apr 24, 2020

Introduce a way to radically reduce the output file size (sacrificing image quality) #541

Closed

jbarlow83 closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement optional downsampling as part of preprocessing #443

Implement optional downsampling as part of preprocessing #443

jbarlow83 commented Oct 31, 2019

heinrich-ulbricht commented Apr 23, 2020 •

edited

Loading

tcurdt commented Apr 23, 2020

jbarlow83 commented Apr 24, 2020

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading

jbarlow83 commented Oct 10, 2023

Implement optional downsampling as part of preprocessing #443

Implement optional downsampling as part of preprocessing #443

Comments

jbarlow83 commented Oct 31, 2019

heinrich-ulbricht commented Apr 23, 2020 • edited Loading

tcurdt commented Apr 23, 2020

jbarlow83 commented Apr 24, 2020

heinrich-ulbricht commented Apr 24, 2020 • edited Loading

jbarlow83 commented Oct 10, 2023

heinrich-ulbricht commented Apr 23, 2020 •

edited

Loading

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading