Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optional downsampling as part of preprocessing #443

Closed
jbarlow83 opened this issue Oct 31, 2019 · 5 comments
Closed

Implement optional downsampling as part of preprocessing #443

jbarlow83 opened this issue Oct 31, 2019 · 5 comments

Comments

@jbarlow83
Copy link
Collaborator

(As requested by @tcurdt)

Some users want optional downsampling of images, but we don't want to do this all the time in case the file has high resolution scans (e.g. film).

Like Acrobat, should downsample images above some threshold.

TBD:

  • What to do if we don't have any reason to flatten the PDF to images (no deskew, clean) - we should probably downsample on an image by image basis, rather than page by page, in this case, which suggests putting downsampling in the optimizer
  • Generally it makes sense to downsample any image that would be too big Tesseract, because at least then we can do something with it
  • Test image before OCR and downsample images that are too big and then scale them up
  • To a degree it is better to aggressively lower JPEG quality than downsample, because JPEG encoders will save the bit budget for high detail areas. This starts to break down for very large images because they take up a lot of memory when decompressed and make the file unwieldy. So we may want to satisfy a request to downsample with some combination of downsampling and JPEG quality lowering.
@heinrich-ulbricht
Copy link

heinrich-ulbricht commented Apr 23, 2020

Does this ticket cover the following use case, where I need to "emulate a scanner":

  • take high quality photos of documents with a smartphone camera (JPG) ("scanning" the documents)
  • create a PDF from those images using e.g. img2pdf (note: the PDF now is quite big)
  • use OCRmyPDF to add the OCR text, which gives good results because the images are of such high quality (this currently is possible with OCRmyPDF)
  • now my wish: after doing OCR, convert the high quality images to really small monochrome ones (like monochrome TIFF with G4 compression) and save them to the PDF, replacing the original images - this can reduce the file size by a factor of 10

Expected result: a really small PDF with OCRd text based on high quality images shown on low quality images.

What is my goal? I want to delete the original images and also don't need high quality images in the PDF. I only roughly need to see what the documents look like. But I need the text for full text search.

To my understanding this is currantly not possible. Either I feed the high quality PDF into OCRmyPDF, get good OCR results and some compression of the images. OR I reduce the PDF/image quality before feeding them to OCRmyPDF which results in a much smaller file but not as good OCR results as with the original images.

The best thing for me would be some kind of callback where I could specify a command line that would be executed on the images that are about to be inserted into the PDF but after doing OCR. I'd like to execute a command like this (which generates a small monochrome TIFF from the JPEG):

jpegtopnm image.jpg | pamthreshold | pamtotiff -g4 > image.tiff

(This might be easy to implement if the image currently is placed somewhere in a temporary directory.)

Does this make sense? Is there something like this possible already? Does this feature suggestion need to be placed in a new ticket?

@tcurdt
Copy link

tcurdt commented Apr 23, 2020

@heinrich-ulbricht My use case is about reducing the embedded image size - but not necessarily completely replacing them. But the outcomes are pretty much aligned. Archiving a PDF with OCR data and reduced image data.

@jbarlow83
Copy link
Collaborator Author

@heinrich-ulbricht You are correct that what you are describing is not currently possible with OCRmyPDF.

@heinrich-ulbricht
Copy link

heinrich-ulbricht commented Apr 24, 2020

I created a ticket specific to my use case to not highjack this ticket further: #541

@jbarlow83
Copy link
Collaborator Author

ocrmypdf now picks the output resolution more carefully, so this is partially done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants