-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement optional downsampling as part of preprocessing #443
Comments
Does this ticket cover the following use case, where I need to "emulate a scanner":
Expected result: a really small PDF with OCRd text based on high quality images shown on low quality images. What is my goal? I want to delete the original images and also don't need high quality images in the PDF. I only roughly need to see what the documents look like. But I need the text for full text search. To my understanding this is currantly not possible. Either I feed the high quality PDF into OCRmyPDF, get good OCR results and some compression of the images. OR I reduce the PDF/image quality before feeding them to OCRmyPDF which results in a much smaller file but not as good OCR results as with the original images. The best thing for me would be some kind of callback where I could specify a command line that would be executed on the images that are about to be inserted into the PDF but after doing OCR. I'd like to execute a command like this (which generates a small monochrome TIFF from the JPEG):
(This might be easy to implement if the image currently is placed somewhere in a temporary directory.) Does this make sense? Is there something like this possible already? Does this feature suggestion need to be placed in a new ticket? |
@heinrich-ulbricht My use case is about reducing the embedded image size - but not necessarily completely replacing them. But the outcomes are pretty much aligned. Archiving a PDF with OCR data and reduced image data. |
@heinrich-ulbricht You are correct that what you are describing is not currently possible with OCRmyPDF. |
I created a ticket specific to my use case to not highjack this ticket further: #541 |
ocrmypdf now picks the output resolution more carefully, so this is partially done. |
(As requested by @tcurdt)
Some users want optional downsampling of images, but we don't want to do this all the time in case the file has high resolution scans (e.g. film).
Like Acrobat, should downsample images above some threshold.
TBD:
The text was updated successfully, but these errors were encountered: