Images unnecessarily compressed? #163

Jmuccigr · 2017-05-03T21:23:56Z

I'm starting a workflow with my own (sometimes badly) scanned PDFs from books, which I convert to pgm, then feed to unpaper to get each of the two visible pages into one document each (which I don't think ocrmypdf can use unpaper to do, right?). Then I clean these up a little and run img2pdf output.pfg | ocrmypdf —image-dpi 150 - result.pdf.

What I see with this is that ocrmypdf is converting the input images to jpeg, which I thought it would do only if it has to force-ocr them. Do I misunderstand? Is it possible to leave the original images (which is #125 )?

(I didn't quite understand if a solution was found here.)

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-05-03T22:17:13Z

which I don't think ocrmypdf can use unpaper to do, right?

No, it can't do anything fancy with unpaper.

Then I clean these up a little and run img2pdf output.pfg | ocrmypdf —image-dpi 150 - result.pdf.

I think this could a recent Ghostscript behavior change. If you don't care about PDF/A you can use --output-type pdf on the current version.

I'll make sure Ghostscript gets explicit directions about recompression in the next release.

…eeded

jbarlow83 · 2017-05-08T22:03:19Z

Ghostscript's default behavior (that is, -dAutoFilter{Color,Gray}Images=true) is to automatically select what it considers an appropriate encoding for each image. It seems to have a heuristic that decides whether to use JPEG (DCTEncode) or lossless (FlateEncode) on an image by image basis. It's probably looking at the number of colors or something. Whatever it is doing is fairly reliable, enough that it escaped detection until now, although I have found images that it chooses to DCT encode when it probably shouldn't.

-dAutoFilterColorImages=false is ignored unless -sColorImageFilter. When the latter is set, the specified encoding is applied for all images on all pages. It does not seem that Ghostscript has an option that always uses the input image type. Probably pdfwrite does not preserve this information.

Because of this I will add a new argument to choose the output image type: auto, jpeg, or lossless. Auto will let Ghostscript decide, or when --output-type pdf (qpdf output) is selected, always use the input image type. The new options will allow one to force the encoding to the desired option for either output type.

jbarlow83 added the bug label May 3, 2017

jbarlow83 pushed a commit that referenced this issue May 7, 2017

Fix issue #163, color and grayscale images JPEG compressed when not n…

93e802f

…eeded

jbarlow83 closed this as completed in 01a1c2b May 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images unnecessarily compressed? #163

Images unnecessarily compressed? #163

Jmuccigr commented May 3, 2017

jbarlow83 commented May 3, 2017

jbarlow83 commented May 8, 2017

Images unnecessarily compressed? #163

Images unnecessarily compressed? #163

Comments

Jmuccigr commented May 3, 2017

jbarlow83 commented May 3, 2017

jbarlow83 commented May 8, 2017