-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a way to radically reduce the output file size (sacrificing image quality) #541
Comments
I would go with modifying ocrmypdf, and:
Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway. |
I'm giving it a try and am having some success. @jbarlow83 A question: This For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.) And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:
Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug? Note: I suspect that above mentioned (But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.) Edit: the debug output shows me this command line that is being used by OCRmyPDF:
The TIF files look good. |
I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still I was able to quick-fix this by inserting Hypothesis: it was never intended to change the color space during image optimization? Edit:
Edit2: |
- adding option to run user-provided shell script for image transformation - fixing ColorSpace not being set on G4 conversion - adding generated directories to gitignore
I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF. The shell script that works for me looks like this:
This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. The command that I used to test looks like this:
I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point. |
You are correct, those
Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.
Agreed - that's a lot of new dependencies to add. |
- adding option to run user-provided shell script for image transformation - fixing ColorSpace not being set on G4 conversion - adding generated directories to gitignore
I also needed exactly this! I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10 It works fine now though! Thanks! |
userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to |
If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: andersjohansson@4e5b68f |
What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:
You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big. With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group. However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested. |
@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks. |
(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing. |
@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it. But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable. |
I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode) |
It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way. |
Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done. |
@blaueente @v217 I understand license-(in)compatibility is inhibiting progress. I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44. Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC? |
Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways.
Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries. |
I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on? |
OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf) The text layer and document metadata seems untouched, and pdfimages output seems sensible:
Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing? |
VeraPDF also doesn't seem to complain:
|
Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back): https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script) The only argument is the input pdf, and then it will save the compressed PDF to If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up. As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases? |
One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that. |
I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table. I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure
|
Dan Bloomberg and Chris Donlan discussed some memory-issue with leptonica [here] researching the CFFI-connection: (DanBloomberg/leptonica#603) recently: |
I'm quite new to Python, so the subject of linking C/C++ is also new to me. Ever seen all options in this overview? https://realpython.com/python-bindings-overview/
|
The main issue I had with Leptonica is that it writes error messages to stderr. As explained in this source comment this caused problems and occasionally deadlocks. I asked Dan Bloomberg to add an API that had Leptonica call a "write error message" function instead, so that the application could capture the error messages more sensibly. That worked fine for a few months until Apple Silicon came along. The callback function needed to provide a Python handler for errors needed to go in write+execute memory, which Apple Silicon forbids. The only option was to switch to CFFI API bindings, which allows it to pre-allocate and compile the callback function in read-only memory. That would mean all of ocrmypdf would need binary wheels, or I'd have to spin off leptonica so just it would have binary wheels, and I'd have two binary wheel projects I was maintaining to support ocrmypdf. For a while I toyed with releasing an improved version of ocrmypdf's leptonica as a separate project rather than killing the module. I came up with a (I think, anyway) clever way to automatically construct class bindings by introspecting function signatures (e.g. pixWhatever becomes a Python method of Pix.whatever), since Leptonica is quite internally consistent. Likewise it would wrap all Python data types in appropriate cffi objects before passing them to the C function. I never released this code but if you want I'll post it... but hold on a sec. I decided against a separate leptonica because binary wheels too much extra work. It's worse now that everyone wants ARM wheels too and GitHub Actions isn't provide ARM runners. I think I spend more time on release engineering for pikepdf than I do writing code. I've tried all the Python binding methods at some point. CFFI is terrific for applications like the demo code in that issue, when you want a little utility to call a C library for you with minimal pain. For anything C++, use pybind11. Binding works great for using libraries but if you want also want multiprocessing, multithreading, stability, nice error handling... that's where it gets difficult. Memory leaks when binding libraries are just inevitable. You have two different memory models, two different memory allocators, sharing a process. You'll end up with reference cycles that cross the language boundary. And that's not even getting into multithreading issues. For almost anything image processing related in Python, I'd say just use OpenCV or scikit-image. scikit-image has a nice API. OpenCV is really comprehensive. Neither are as "document image oriented" as Leptonica, but they're already ported to Python. But, JBIG2 falls under the "almost anything" exception, because Leptonica has the necessarily functions for JBIG2. To my knowledge the other two libraries don't. What I think may be best here is to bind jbig2enc directly instead of Leptonica - along with any improvements you have for it. jbig2enc for python would be a very useful contribution to open source because it's not widely packaged. This also means you'd be proving a library with a small number of entry points rather than taking a library as massive as leptonica. |
Adapting jbig2enc and probably try to Cython it already came to my mind. I’ll explore that path further.
I wondered if I would go that far to read a HOCR from jbig2enc, but as interpreting HOCR seems to be a moving target I’ll probably have to stay close to Merlijn’s source.
|
@jbarlow83, the HOCR contained in the PDF misses confidence-data. An adapted jbig2enc might profit from it. Could you add the original hocr to the plugin-interface?
|
There's no way to attach confidence to PDF that I know of, but if you run in HOCR output mode, you could just read the .hocr file for the page from any plugin in the temporary folder. The context object tells you where to find the temporary folder. |
Thanks! That will do.
|
I have a script djvu2pdf in my repos, and there you can find python code that substitutes an image in pdf with another image. Then you just split pdf into pages, run reocr on every page, use imagemagick to convert your large size pdf to monochrome png like i do(you could plausibly go as low as 2000pix and text is still readable), insert that png back into the file via that python snippet, run max compression ocrmypdf with no ocring and all compression options(no deskew though), and you'll go as low as 20kb per page. 20 meg per large tome. |
I don't see it can handle MRC-djvu's with jb2 bitonal masks in it? |
i don't even know what that means, but it works on my files so far, the conversion is progressing. i did see jb2 in djvudumps and i assume it's converting correctly |
Can anyone tell if the write+execute memory problem also exists with ctypes? FWIW pdfium also has a callback mechanism to write into a buffer, which seems fairly similar to this error handler callback. pypdfium2 uses this with ctypes, and didn't get an issue report yet. |
At a rough guess, ctypes has created a general purpose trampoline for handling all callbacks, so it doesn't need to allocate W+X memory, while cffi tries to allocate a handler for each @ffi.callback. I think cffi really doesn't like their ABI interface - the documentation steers you to set up an API. |
So it might be possible to keep leptonica bindings ABI mode and avoid the pain of platform+python specific wheels by using ctypes instead of cffi? |
Yes, but it still won't be possible to keep the leptonica ABI bindings without avoiding the pain of ABI bindings. :) |
Closing this issue since the third party plugin |
Yes, it's fine to me that ocrmypdf has resolved its leptonica-free way.
May I ask if you just mean the necessity of carrying around bindings / C extension in general, or further technical concerns with ABI mode? |
That's a good question. I think the concerns are more social - unless a library is something like glibc that actively tests its own ABI for unintended breaking changes, it's probably better to stick to API binding for stability. Understandably most libraries don't. |
Isn't that more of a question of static vs. dynamic linking (rsp. library loading) ? Please correct me if I've misunderstood something, I'm uneducated on the subject after all. Footnotes
|
Suppose we have an C API like int foreign(int x, int y, int z) in a new release, the developer inserts a new parameter int foreign(int x, int y, void *newarg, int z) If we are using API binding, the caller declares the function signatures it expects to be found in the target library. (For CFFI, a small C extension library is compiled that does this, so binary wheels are required.) The platform's executable loader will refuse to load the target library if there is a mismatch and terminates the process. So we do rely on a stable ABI for the extension to work correctly, but when it fails, it fails safely. The failure detection is also future proof modulo bugs - a future API changes that breaks the ABI, by either caller or callee, will fail safely. If we are using ABI binding (or any sort of dynamic loading), then a change in API signature will not be noticed at runtime. It might even work for a long time; perhaps the caller always used z=0 and newarg=nullptr is permitted. We just go ahead and blindly throw binary data over the wall and hope nothing blows up. Even if the library developer documents the change, even if you fix the issue, all of your old code that has outdated bindings will break if someone accidentally runs it against the new library. Hopefully library developers know that changing parameter order in public functions can be messy downstream but, it happens, sometimes unintentionally too from autogenerated code, changes to struct definitions, etc. CFFI's ABI tries to add a little more checking to the situation compared to ctypes - it parses the C #include headers and uses them to generate the signatures. Similar to ctypesgen I imagine. But it doesn't have the ability to check at runtime if the generated signatures are compatible. These checks have their limits - they are not full C compilers and they can't follow some code constructs. |
Thanks for the superb explanation! IIUC it comes down to API mode being inherently safe against ABI violations, whereas for ABI mode it all depends on the correctness of the provided bindings interface, which needs to be an exact match of the target binary's ABI. However, I believe it should be possible to achieve a reasonable degree of ABI safety by generating bindings from the binary's compile headers and tying them together (e.g. by bundling in wheels), and upstreams being careful not to change existing functions, but instead add new ones and eventually deprecate/remove the others, ideally accompanied by semantic versioning. |
Is this plugin supposed to work the the recent OCRmyPDF version? I tried to use it with the docker version Also I wonder whether the features ( |
My plugin was a proof of concept. Not at all production ready.
Since then my attention has diverged to find a better OCR as the bugs in Tesseract leave holes in the compression result of the plugin.
I solved some of them in Tesseract, but the testing capacity of Tesseract to accept my pull requests is too low. Scribe OCR maintains his OWN fork of Tesseract:
https://github.com/scribeocr/scribeocr
The newest Open Source OCR I found but not yet tried is https://github.com/Ucas-HaoranWei/GOT-OCR2.0
I also follow the progression around Paddlepaddle and Surya OCR. They develop to be more content aware.
Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>
…________________________________
From: drtrigon ***@***.***>
Sent: Friday, October 25, 2024 10:59:04 PM
To: ocrmypdf/OCRmyPDF ***@***.***>
Cc: rmast ***@***.***>; Mention ***@***.***>
Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541)
Closing this issue since the third party plugin https://github.com/rmast/OCRmyPDF_DjVu_Optimize_Plugin larger achieves the desired objective.
Is this plugin supposed to work the the recent OCRmyPDF version? I tried to use it with the docker version jbarlow83/ocrmypdf-alpine and it raises an error.
Also I wonder whether the features (--grayscale-ocr and --mono-page) from the example plugin misc/example_plugin.py are "production-ready" ?
—
Reply to this email directly, view it on GitHub<#541 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5RPVLWVXQYGPYJLXFDZ5KWJRAVCNFSM6AAAAABQUAXO5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYHAYTEMBZHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@rmast I'm very much interested in your search for a better OCR and your improvements of Tesseract . What about kraken? |
Some years ago I experimented with OCR4All, and those developers pointer me to OCR-D for further development. OCR4All used Calamari, which mentioned Kraken.
I'm not very well into historic documents though, you'd better search the Internet.
Kraken and escriptorium seem to be maintained at European universities.
https://www.researchgate.net/publication/305995374_OCR_of_historical_printings_with_an_application_to_building_diachronic_corpora_A_case_study_using_the_RIDGES_herbal_corpus
Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>
…________________________________
From: Janusz S. Bień ***@***.***>
Sent: Saturday, October 26, 2024 7:41:51 AM
To: ocrmypdf/OCRmyPDF ***@***.***>
Cc: rmast ***@***.***>; Mention ***@***.***>
Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541)
@rmast<https://github.com/rmast> I'm very much interested in your search for a better OCR and your improvements of Tesseract . What about kraken?
For the time being I'm satisfied with the OCR embedded in the tools I use (Teserract I think) but some time in the future I would like to OCR some 16th century prints (training will be necessary) without using closed cloud services like Transkribus.
As this is off the topic of this thread, you may answer in the discussion tab at https://github.com/jsbien/early_fonts_inventory or in any other place you find convenient.
—
Reply to this email directly, view it on GitHub<#541 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5XZRWOFO6OYYXMXYB3Z5MTR7AVCNFSM6AAAAABQUAXO5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGM3DIOJVGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for the link. I was told kraken/escriptorium can be considered an alternative to Transcribus and this made me curious. Escriptorium is available as docker, so perhaps I will make a try in the not far future. I tried earlier several interesting systems, but they appeared no longer maintained and/or not fully documented... Nevertheless I am still interested in your experiences with the OCR engines, please share them somehow from time to time. |
Hello, I was reading about the plugin OCRmyPDF_DjVu_Optimize_Plugin, that seems to implement MCR compression using DjVu, but I don't see any example or information in it's Github page on how to install, compile or use that plugin with OCRmyPDF. Has anyone used it and has information about it? |
@LexLuthorX: See my comment from Oct 25:
It was more a proof-of-concept than a real thing. |
Is your feature request related to a problem? Please describe.
My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.
I describes this in more detail here: #443 (comment)
Furthermore I see a discussion covering a similar topic here: #293
Describe the solution you'd like
I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):
Additional context
I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:
I'm not sure about the second approach - where would be a good point to start? One approach could be:
@jbarlow83 Does this sound right?
The text was updated successfully, but these errors were encountered: