Introduce a way to radically reduce the output file size (sacrificing image quality) #541

heinrich-ulbricht · 2020-04-24T09:08:46Z

Is your feature request related to a problem? Please describe.
My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.

I describes this in more detail here: #443 (comment)

Furthermore I see a discussion covering a similar topic here: #293

Describe the solution you'd like
I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):

add a parameter that forces all images to be converted to 1 bpp images (low effort)
add a parameter allowing arbitrary shell commands to be passed that will be executed by OCRmyPDF on the images in the temporary folder, before OCRmyPDF handles them further (high effort, security implications?)
introduce multiple parameters that allow for more control of the things that go on in the optimization step (probably here) (medium effort?)

Additional context
I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:

let OCRmyPDF do it's thing on high quality images/PDFs; post-process manually using pikepdf using a Python script that replaces the high quality images with low quality ones in the PDF (I have a working PoC, but it's not pretty)
modify OCRmyPDF

I'm not sure about the second approach - where would be a good point to start? One approach could be:

using PNG images in the input PDF file, then
forcing pngquant to convert them to 1 bpp (here?)
this could trigger PNG rewriting as G4 (here)

@jbarlow83 Does this sound right?

jbarlow83 · 2020-04-24T09:55:09Z

I would go with modifying ocrmypdf, and:

Always input JPG
Replace pngquant.quantize with code that always converts the image to 1bpp (e.g. just use PIllow).
You will actually want to install jbig2enc. JBIG2 outperforms G4 in size and is still widely supported. 1bpp PNGs will always be converted to JBIG2 when a jbig2 encoder is available. You might even want JBIG2 in lossy mode, provided the dangers of lossy mode are acceptable to you (see documentation and the "6-8" problem).

Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway.

heinrich-ulbricht · 2020-04-24T13:49:38Z

I'm giving it a try and am having some success.

@jbarlow83 A question: This return doesn't look right since it leaves the function after handling only one image. Is this ok?

https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L426

For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.)

And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:

my quality-reduced PNGs
the corresponding generated TIFs - each looking good

Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug?

Note: I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

(But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.)

Edit: the debug output shows me this command line that is being used by OCRmyPDF:

  DEBUG - Running: ['jbig2', '-b', 'group00000000', '-s', '-p', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000032.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000028.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000030.tif']

The TIF files look good.

heinrich-ulbricht · 2020-04-24T19:47:55Z

I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still /DeviceRGB:

But correct would be /DeviceGray.

I was able to quick-fix this by inserting im_obj.ColorSpace = Name("/DeviceGray") right before this line:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L340
The PDF now looks good.

Hypothesis: it was never intended to change the color space during image optimization?

Edit:
Suggested fix:

if (Name.BitsPerComponent in im_obj and im_obj.BitsPerComponent == 1):
  log.debug("Setting ColorSpace to /DeviceGray")
  im_obj.ColorSpace = Name("/DeviceGray")

Edit2:
Better fix?
Add im_obj.ColorSpace = Name("/DeviceGray") here:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L430

- adding option to run user-provided shell script for image transformation - fixing ColorSpace not being set on G4 conversion - adding generated directories to gitignore

heinrich-ulbricht · 2020-04-24T22:17:35Z

I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF.

The shell script that works for me looks like this:

#!/bin/sh
convert -colorspace gray -fill white -sharpen 0x2 "$1" - | jpegtopnm | pamthreshold | pamtotiff -g4 > "$2"

This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. pamthreshold implements a nice dynamic threshold.

The command that I used to test looks like this:

ocrmypdf --user-script-jpg-to-1bpp-tif shell.sh --jbig2-lossy -v 1 -O3 in.pdf out.pdf

in.pdf: 791 KB (created from three colored JPGs using img2pdf)
out.pdf: 72 KB
Optimize ratio: 11.19 savings: 91.1%

I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point.

jbarlow83 · 2020-04-25T09:44:23Z

I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

You are correct, those returns are wrong and will suppress multiple images per file. That's a great catch.

Hypothesis: it was never intended to change the color space during image optimization?

Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.

I'm not opening a pull request since the solution

Agreed - that's a lot of new dependencies to add.

- adding option to run user-provided shell script for image transformation - fixing ColorSpace not being set on G4 conversion - adding generated directories to gitignore

andersjohansson · 2020-06-25T14:53:14Z

I also needed exactly this!

I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10

It works fine now though! Thanks!

jbarlow83 · 2020-06-25T22:55:07Z

userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to pluginspec.py, and then we could have a generic, pluggable interface for people who want to optimize images more aggressively.

andersjohansson · 2020-06-26T08:50:47Z

If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: andersjohansson@4e5b68f

rmast · 2021-11-07T20:24:34Z

What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:

#836 (comment)

(We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)

You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big.

With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group.

However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested.

jwilk/didjvu#18

jbarlow83 · 2021-11-08T07:02:05Z

@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks.

MerlijnWajer · 2021-11-24T19:25:32Z

(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing.

jbarlow83 · 2021-11-24T22:10:59Z

@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it.

But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable.

MerlijnWajer · 2021-11-25T12:01:58Z

I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode)

jbarlow83 · 2021-11-25T22:39:33Z

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

MerlijnWajer · 2021-11-25T22:45:06Z

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done.

rmast · 2022-01-01T23:39:17Z

@blaueente @v217
I saw your input in these issues concerning introducing MRC into OCRMyPDF:
#9
fritz-hh/OCRmyPDF#88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

blaueente · 2022-01-04T20:27:43Z

@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: #9 fritz-hh/OCRmyPDF#88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways.
C44 / iw44 is the wavelet codec used by didjvu, and therefore unusable for PDF MRCs.
The ideas of archive-pdf-tools seem pretty good to me, maybe they could learn from gamera's separation algorithms, and the ROI-style coding of iw44, although I see good discussions in their github page.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries.

MerlijnWajer · 2022-05-03T23:21:07Z

I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on?

MerlijnWajer · 2022-05-03T23:28:09Z

OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf)

example.tar.gz

The text layer and document metadata seems untouched, and pdfimages output seems sensible:

$ pdfimages -list /tmp/ocrmypdf.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpeg   no        12  0   762   762  635K 2.5%

$ pdfimages -list /tmp/out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpx    no        16  0   762   762 12.8K 0.1%
   1     1 image    2472  3484  rgb     3   8  jpx    no        17  0   762   762 62.6K 0.2%
   1     2 smask    2472  3484  gray    1   1  jbig2  no        17  0   762   762 41.9K 4.0%

Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing?

MerlijnWajer · 2022-05-03T23:32:52Z

VeraPDF also doesn't seem to complain:

~/verapdf/verapdf --format text --flavour 2b /tmp/out.pdf
PASS /tmp/out.pdf

MerlijnWajer · 2022-05-03T23:52:21Z

Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back):

https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script)

The only argument is the input pdf, and then it will save the compressed PDF to /tmp/out.pdf. You will need archive-pdf-tools==1.4.13 installed (available via pip). Depending on which code is commented it can compress JPEG2000 using Pillow, JPEG using jpegoptim, or JPEG2000 using kakadu.

If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up.

As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases?

MerlijnWajer · 2022-05-03T23:55:43Z

One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that.

rmast · 2022-05-04T11:44:10Z

I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table.

I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure

sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip
pip install archive-pdf-tools==1.4.13
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git

git clone https://github.com/agl/jbig2enc.git
wget https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py

cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install

cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install

mara004 · 2022-07-03T16:02:22Z

@rmast This seems to be the latest version of ocrmypdf's removed leptonica module.

rmast · 2022-07-03T16:05:28Z

Dan Bloomberg and Chris Donlan discussed some memory-issue with leptonica [here] researching the CFFI-connection: (DanBloomberg/leptonica#603) recently:

rmast · 2022-07-03T17:02:40Z

I'm quite new to Python, so the subject of linking C/C++ is also new to me. Ever seen all options in this overview? https://realpython.com/python-bindings-overview/

jbarlow83 · 2022-07-03T20:57:49Z

The main issue I had with Leptonica is that it writes error messages to stderr. As explained in this source comment this caused problems and occasionally deadlocks. I asked Dan Bloomberg to add an API that had Leptonica call a "write error message" function instead, so that the application could capture the error messages more sensibly.

That worked fine for a few months until Apple Silicon came along. The callback function needed to provide a Python handler for errors needed to go in write+execute memory, which Apple Silicon forbids. The only option was to switch to CFFI API bindings, which allows it to pre-allocate and compile the callback function in read-only memory. That would mean all of ocrmypdf would need binary wheels, or I'd have to spin off leptonica so just it would have binary wheels, and I'd have two binary wheel projects I was maintaining to support ocrmypdf.

For a while I toyed with releasing an improved version of ocrmypdf's leptonica as a separate project rather than killing the module. I came up with a (I think, anyway) clever way to automatically construct class bindings by introspecting function signatures (e.g. pixWhatever becomes a Python method of Pix.whatever), since Leptonica is quite internally consistent. Likewise it would wrap all Python data types in appropriate cffi objects before passing them to the C function. I never released this code but if you want I'll post it... but hold on a sec.

I decided against a separate leptonica because binary wheels too much extra work. It's worse now that everyone wants ARM wheels too and GitHub Actions isn't provide ARM runners. I think I spend more time on release engineering for pikepdf than I do writing code.

I've tried all the Python binding methods at some point. CFFI is terrific for applications like the demo code in that issue, when you want a little utility to call a C library for you with minimal pain. For anything C++, use pybind11.

Binding works great for using libraries but if you want also want multiprocessing, multithreading, stability, nice error handling... that's where it gets difficult. Memory leaks when binding libraries are just inevitable. You have two different memory models, two different memory allocators, sharing a process. You'll end up with reference cycles that cross the language boundary. And that's not even getting into multithreading issues.

For almost anything image processing related in Python, I'd say just use OpenCV or scikit-image. scikit-image has a nice API. OpenCV is really comprehensive. Neither are as "document image oriented" as Leptonica, but they're already ported to Python.

But, JBIG2 falls under the "almost anything" exception, because Leptonica has the necessarily functions for JBIG2. To my knowledge the other two libraries don't.

What I think may be best here is to bind jbig2enc directly instead of Leptonica - along with any improvements you have for it. jbig2enc for python would be a very useful contribution to open source because it's not widely packaged. This also means you'd be proving a library with a small number of entry points rather than taking a library as massive as leptonica.

rmast · 2022-07-04T06:48:27Z

Adapting jbig2enc and probably try to Cython it already came to my mind. I’ll explore that path further. I wondered if I would go that far to read a HOCR from jbig2enc, but as interpreting HOCR seems to be a moving target I’ll probably have to stay close to Merlijn’s source.

rmast · 2022-07-04T09:05:29Z

@jbarlow83, the HOCR contained in the PDF misses confidence-data. An adapted jbig2enc might profit from it. Could you add the original hocr to the plugin-interface?

jbarlow83 · 2022-07-04T20:02:01Z

There's no way to attach confidence to PDF that I know of, but if you run in HOCR output mode, you could just read the .hocr file for the page from any plugin in the temporary folder. The context object tells you where to find the temporary folder.

rmast · 2022-07-04T20:56:27Z

Thanks! That will do.

installgentoo · 2023-05-31T17:01:00Z

I have a script djvu2pdf in my repos, and there you can find python code that substitutes an image in pdf with another image.

Then you just split pdf into pages, run reocr on every page, use imagemagick to convert your large size pdf to monochrome png like i do(you could plausibly go as low as 2000pix and text is still readable), insert that png back into the file via that python snippet, run max compression ocrmypdf with no ocring and all compression options(no deskew though), and you'll go as low as 20kb per page. 20 meg per large tome.

rmast · 2023-05-31T18:37:27Z

I have a script djvu2pdf in my repos, and there you can find python code that substitutes an image in pdf with another image.

I don't see it can handle MRC-djvu's with jb2 bitonal masks in it?

installgentoo · 2023-05-31T19:01:11Z

I don't see it can handle MRC-djvu's with jb2 bitonal masks in it?

i don't even know what that means, but it works on my files so far, the conversion is progressing. i did see jb2 in djvudumps and i assume it's converting correctly

mara004 · 2023-11-28T19:30:07Z

Can anyone tell if the write+execute memory problem also exists with ctypes?
There's lots of info on the issue with cffi, but I can't seem to find anything with ctypes.

FWIW pdfium also has a callback mechanism to write into a buffer, which seems fairly similar to this error handler callback. pypdfium2 uses this with ctypes, and didn't get an issue report yet.

jbarlow83 · 2023-11-28T20:00:37Z

At a rough guess, ctypes has created a general purpose trampoline for handling all callbacks, so it doesn't need to allocate W+X memory, while cffi tries to allocate a handler for each @ffi.callback.

I think cffi really doesn't like their ABI interface - the documentation steers you to set up an API.

mara004 · 2023-11-28T21:00:54Z

So it might be possible to keep leptonica bindings ABI mode and avoid the pain of platform+python specific wheels by using ctypes instead of cffi?

jbarlow83 · 2023-11-30T07:34:44Z

Yes, but it still won't be possible to keep the leptonica ABI bindings without avoiding the pain of ABI bindings. :)
I think at this point I would just use one of the other Python image libraries and I've replaced everything used to do except one feature.

jbarlow83 · 2023-11-30T07:37:08Z

Closing this issue since the third party plugin
https://github.com/rmast/OCRmyPDF_DjVu_Optimize_Plugin
larger achieves the desired objective.

mara004 · 2023-11-30T13:39:50Z

I think at this point I would just use one of the other Python image libraries and I've replaced everything used to do except one feature.

Yes, it's fine to me that ocrmypdf has resolved its leptonica-free way.
I mostly asked to better understand the limitations of ABI bindings, and out of personal interest in trying some leptonica APIs (dewarping).

Yes, but it still won't be possible to keep the leptonica ABI bindings without avoiding the pain of ABI bindings. :)

May I ask if you just mean the necessity of carrying around bindings / C extension in general, or further technical concerns with ABI mode?

jbarlow83 · 2023-11-30T17:41:04Z

That's a good question. I think the concerns are more social - unless a library is something like glibc that actively tests its own ABI for unintended breaking changes, it's probably better to stick to API binding for stability. Understandably most libraries don't.

mara004 · 2023-11-30T18:49:17Z

Isn't that more of a question of static vs. dynamic linking (rsp. library loading) ?
If we tie binary/bindings together (i.e. enforced version match),¹ there should be no changes by design.
And on the other hand, if we use API mode but do dynamic linking, and the target library is rebuilt but not the bindings extension, we also rely on a stable interface?

Please correct me if I've misunderstood something, I'm uneducated on the subject after all.

Assuming bindings are built using an automatic tool like ctypesgen and the caller provided the right headers ↩

jbarlow83 · 2023-11-30T22:52:26Z

Suppose we have an C API like

int foreign(int x, int y, int z)

in a new release, the developer inserts a new parameter

int foreign(int x, int y, void *newarg, int z)

If we are using API binding, the caller declares the function signatures it expects to be found in the target library. (For CFFI, a small C extension library is compiled that does this, so binary wheels are required.) The platform's executable loader will refuse to load the target library if there is a mismatch and terminates the process. So we do rely on a stable ABI for the extension to work correctly, but when it fails, it fails safely. The failure detection is also future proof modulo bugs - a future API changes that breaks the ABI, by either caller or callee, will fail safely.

If we are using ABI binding (or any sort of dynamic loading), then a change in API signature will not be noticed at runtime. It might even work for a long time; perhaps the caller always used z=0 and newarg=nullptr is permitted. We just go ahead and blindly throw binary data over the wall and hope nothing blows up. Even if the library developer documents the change, even if you fix the issue, all of your old code that has outdated bindings will break if someone accidentally runs it against the new library. Hopefully library developers know that changing parameter order in public functions can be messy downstream but, it happens, sometimes unintentionally too from autogenerated code, changes to struct definitions, etc.

CFFI's ABI tries to add a little more checking to the situation compared to ctypes - it parses the C #include headers and uses them to generate the signatures. Similar to ctypesgen I imagine. But it doesn't have the ability to check at runtime if the generated signatures are compatible. These checks have their limits - they are not full C compilers and they can't follow some code constructs.

mara004 · 2023-12-01T16:58:16Z

Thanks for the superb explanation!

IIUC it comes down to API mode being inherently safe against ABI violations, whereas for ABI mode it all depends on the correctness of the provided bindings interface, which needs to be an exact match of the target binary's ABI.
So the responsibility falls to bindings packagers and C library developers.

However, I believe it should be possible to achieve a reasonable degree of ABI safety by generating bindings from the binary's compile headers and tying them together (e.g. by bundling in wheels), and upstreams being careful not to change existing functions, but instead add new ones and eventually deprecate/remove the others, ideally accompanied by semantic versioning.

drtrigon · 2024-10-25T20:58:41Z

Closing this issue since the third party plugin https://github.com/rmast/OCRmyPDF_DjVu_Optimize_Plugin larger achieves the desired objective.

Is this plugin supposed to work the the recent OCRmyPDF version? I tried to use it with the docker version jbarlow83/ocrmypdf-alpine (16.5.1.dev2+g6ca4940a.d20240915) and it raises an error.

Also I wonder whether the features (--grayscale-ocr and --mono-page) from the example plugin misc/example_plugin.py are "production-ready" ?

rmast · 2024-10-26T05:11:37Z

My plugin was a proof of concept. Not at all production ready. Since then my attention has diverged to find a better OCR as the bugs in Tesseract leave holes in the compression result of the plugin. I solved some of them in Tesseract, but the testing capacity of Tesseract to accept my pull requests is too low. Scribe OCR maintains his OWN fork of Tesseract: https://github.com/scribeocr/scribeocr The newest Open Source OCR I found but not yet tried is https://github.com/Ucas-HaoranWei/GOT-OCR2.0 I also follow the progression around Paddlepaddle and Surya OCR. They develop to be more content aware. Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>

…

________________________________ From: drtrigon ***@***.***> Sent: Friday, October 25, 2024 10:59:04 PM To: ocrmypdf/OCRmyPDF ***@***.***> Cc: rmast ***@***.***>; Mention ***@***.***> Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541) Closing this issue since the third party plugin https://github.com/rmast/OCRmyPDF_DjVu_Optimize_Plugin larger achieves the desired objective. Is this plugin supposed to work the the recent OCRmyPDF version? I tried to use it with the docker version jbarlow83/ocrmypdf-alpine and it raises an error. Also I wonder whether the features (--grayscale-ocr and --mono-page) from the example plugin misc/example_plugin.py are "production-ready" ? — Reply to this email directly, view it on GitHub<#541 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5RPVLWVXQYGPYJLXFDZ5KWJRAVCNFSM6AAAAABQUAXO5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYHAYTEMBZHE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jsbien · 2024-10-26T05:41:29Z

@rmast I'm very much interested in your search for a better OCR and your improvements of Tesseract . What about kraken?
For the time being I'm satisfied with the OCR embedded in the tools I use (Teserract I think) but some time in the future I would like to OCR some 16th century prints (training will be necessary) without using closed cloud services like Transkribus.
As this is off the topic of this thread, you may answer in the discussion tab at https://github.com/jsbien/early_fonts_inventory or in any other place you find convenient.

rmast · 2024-10-26T07:42:03Z

Some years ago I experimented with OCR4All, and those developers pointer me to OCR-D for further development. OCR4All used Calamari, which mentioned Kraken. I'm not very well into historic documents though, you'd better search the Internet. Kraken and escriptorium seem to be maintained at European universities. https://www.researchgate.net/publication/305995374_OCR_of_historical_printings_with_an_application_to_building_diachronic_corpora_A_case_study_using_the_RIDGES_herbal_corpus Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Janusz S. Bień ***@***.***> Sent: Saturday, October 26, 2024 7:41:51 AM To: ocrmypdf/OCRmyPDF ***@***.***> Cc: rmast ***@***.***>; Mention ***@***.***> Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541) @rmast<https://github.com/rmast> I'm very much interested in your search for a better OCR and your improvements of Tesseract . What about kraken? For the time being I'm satisfied with the OCR embedded in the tools I use (Teserract I think) but some time in the future I would like to OCR some 16th century prints (training will be necessary) without using closed cloud services like Transkribus. As this is off the topic of this thread, you may answer in the discussion tab at https://github.com/jsbien/early_fonts_inventory or in any other place you find convenient. — Reply to this email directly, view it on GitHub<#541 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5XZRWOFO6OYYXMXYB3Z5MTR7AVCNFSM6AAAAABQUAXO5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGM3DIOJVGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jsbien · 2024-10-26T08:40:18Z

Thanks for the link. I was told kraken/escriptorium can be considered an alternative to Transcribus and this made me curious. Escriptorium is available as docker, so perhaps I will make a try in the not far future. I tried earlier several interesting systems, but they appeared no longer maintained and/or not fully documented... Nevertheless I am still interested in your experiences with the OCR engines, please share them somehow from time to time.

LexLuthorX · 2024-11-28T21:19:55Z

Hello, I was reading about the plugin OCRmyPDF_DjVu_Optimize_Plugin, that seems to implement MCR compression using DjVu, but I don't see any example or information in it's Github page on how to install, compile or use that plugin with OCRmyPDF.

Has anyone used it and has information about it?
Best regards.

drtrigon · 2024-11-29T07:48:52Z

@LexLuthorX: See my comment from Oct 25:

Is this plugin supposed to work the the recent OCRmyPDF version? I tried to use it with the docker version jbarlow83/ocrmypdf-alpine (16.5.1.dev2+g6ca4940a.d20240915) and it raises an error.

It was more a proof-of-concept than a real thing.

heinrich-ulbricht added the enhancement label Apr 24, 2020

heinrich-ulbricht mentioned this issue Apr 24, 2020

Implement optional downsampling as part of preprocessing #443

Closed

MerlijnWajer mentioned this issue Nov 24, 2021

Support recompressing existing PDFs without hOCR files and without touching the text input internetarchive/archive-pdf-tools#28

Open

This was referenced Feb 2, 2022

support monochromatic conversion #906

Closed

Reducing PDF size by splitting the image and compressing each area differently #912

Open

MerlijnWajer mentioned this issue Jun 16, 2023

A user-friendly example for a scanned multipage PDF needed internetarchive/archive-pdf-tools#67

Open

jbarlow83 closed this as completed Nov 30, 2023

Introduce a way to radically reduce the output file size (sacrificing image quality) #541

Introduce a way to radically reduce the output file size (sacrificing image quality) #541

Comments

heinrich-ulbricht commented Apr 24, 2020 • edited Loading

jbarlow83 commented Apr 24, 2020

heinrich-ulbricht commented Apr 24, 2020 • edited Loading

heinrich-ulbricht commented Apr 24, 2020 • edited Loading

heinrich-ulbricht commented Apr 24, 2020 • edited Loading

jbarlow83 commented Apr 25, 2020

andersjohansson commented Jun 25, 2020

jbarlow83 commented Jun 25, 2020

andersjohansson commented Jun 26, 2020

rmast commented Nov 7, 2021 • edited Loading

jbarlow83 commented Nov 8, 2021 • edited Loading

MerlijnWajer commented Nov 24, 2021 • edited Loading

jbarlow83 commented Nov 24, 2021

MerlijnWajer commented Nov 25, 2021 • edited Loading

jbarlow83 commented Nov 25, 2021

MerlijnWajer commented Nov 25, 2021

rmast commented Jan 1, 2022

blaueente commented Jan 4, 2022

MerlijnWajer commented May 3, 2022 • edited Loading

MerlijnWajer commented May 3, 2022

MerlijnWajer commented May 3, 2022

MerlijnWajer commented May 3, 2022 • edited Loading

MerlijnWajer commented May 3, 2022

rmast commented May 4, 2022 • edited Loading

mara004 commented Jul 3, 2022 • edited Loading

rmast commented Jul 3, 2022

rmast commented Jul 3, 2022 via email

jbarlow83 commented Jul 3, 2022

rmast commented Jul 4, 2022 via email

rmast commented Jul 4, 2022 via email • edited Loading

jbarlow83 commented Jul 4, 2022

rmast commented Jul 4, 2022 via email

installgentoo commented May 31, 2023

rmast commented May 31, 2023

installgentoo commented May 31, 2023

mara004 commented Nov 28, 2023 • edited Loading

jbarlow83 commented Nov 28, 2023

mara004 commented Nov 28, 2023

jbarlow83 commented Nov 30, 2023

jbarlow83 commented Nov 30, 2023

mara004 commented Nov 30, 2023

jbarlow83 commented Nov 30, 2023

mara004 commented Nov 30, 2023 • edited Loading

Footnotes

jbarlow83 commented Nov 30, 2023

mara004 commented Dec 1, 2023

drtrigon commented Oct 25, 2024 • edited Loading

rmast commented Oct 26, 2024 via email

jsbien commented Oct 26, 2024

rmast commented Oct 26, 2024 via email

jsbien commented Oct 26, 2024

LexLuthorX commented Nov 28, 2024

drtrigon commented Nov 29, 2024 • edited Loading

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading

heinrich-ulbricht commented Apr 24, 2020 •

edited

Loading

rmast commented Nov 7, 2021 •

edited

Loading

jbarlow83 commented Nov 8, 2021 •

edited

Loading

MerlijnWajer commented Nov 24, 2021 •

edited

Loading

MerlijnWajer commented Nov 25, 2021 •

edited

Loading

MerlijnWajer commented May 3, 2022 •

edited

Loading

MerlijnWajer commented May 3, 2022 •

edited

Loading

rmast commented May 4, 2022 •

edited

Loading

mara004 commented Jul 3, 2022 •

edited

Loading

rmast commented Jul 4, 2022 via email •

edited

Loading

mara004 commented Nov 28, 2023 •

edited

Loading

mara004 commented Nov 30, 2023 •

edited

Loading

drtrigon commented Oct 25, 2024 •

edited

Loading

drtrigon commented Nov 29, 2024 •

edited

Loading