Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images unnecessarily compressed? #163

Closed
Jmuccigr opened this issue May 3, 2017 · 2 comments
Closed

Images unnecessarily compressed? #163

Jmuccigr opened this issue May 3, 2017 · 2 comments
Labels

Comments

@Jmuccigr
Copy link
Contributor

Jmuccigr commented May 3, 2017

I'm starting a workflow with my own (sometimes badly) scanned PDFs from books, which I convert to pgm, then feed to unpaper to get each of the two visible pages into one document each (which I don't think ocrmypdf can use unpaper to do, right?). Then I clean these up a little and run img2pdf output.pfg | ocrmypdf —image-dpi 150 - result.pdf.

What I see with this is that ocrmypdf is converting the input images to jpeg, which I thought it would do only if it has to force-ocr them. Do I misunderstand? Is it possible to leave the original images (which is #125 )?

(I didn't quite understand if a solution was found here.)

@jbarlow83
Copy link
Collaborator

which I don't think ocrmypdf can use unpaper to do, right?

No, it can't do anything fancy with unpaper.

Then I clean these up a little and run img2pdf output.pfg | ocrmypdf —image-dpi 150 - result.pdf.

I think this could a recent Ghostscript behavior change. If you don't care about PDF/A you can use --output-type pdf on the current version.

I'll make sure Ghostscript gets explicit directions about recompression in the next release.

@jbarlow83
Copy link
Collaborator

Ghostscript's default behavior (that is, -dAutoFilter{Color,Gray}Images=true) is to automatically select what it considers an appropriate encoding for each image. It seems to have a heuristic that decides whether to use JPEG (DCTEncode) or lossless (FlateEncode) on an image by image basis. It's probably looking at the number of colors or something. Whatever it is doing is fairly reliable, enough that it escaped detection until now, although I have found images that it chooses to DCT encode when it probably shouldn't.

-dAutoFilterColorImages=false is ignored unless -sColorImageFilter. When the latter is set, the specified encoding is applied for all images on all pages. It does not seem that Ghostscript has an option that always uses the input image type. Probably pdfwrite does not preserve this information.

Because of this I will add a new argument to choose the output image type: auto, jpeg, or lossless. Auto will let Ghostscript decide, or when --output-type pdf (qpdf output) is selected, always use the input image type. The new options will allow one to force the encoding to the desired option for either output type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants