Build a front end for OCRmyPDF (development suggestion) #75

e42mercury · 2024-06-08T12:40:00Z

I love that I can use this directly in Zotero, but it has few options and sometimes creates really huge files. A lot of the feature requests and errors people are asking about here have already been addressed by another program, OCRmyPDF. It is based on the same tesseract engine but also handles rotation, compression, languages... really an excellent program all around. Instead of rewriting a scaled-down version of the same program as a Zotero plugin, I'd suggest a plugin that is just a front end for OCRmyPDF. Just a window inside Zotero that would help you 1) install everything and 2) choose the many options in OCRmyPDF, described in its excellent documentation, while it runs in the background.

The only problem is that it's command line only. So it would definitely be a contribution to the use community if someone made it useable from a Zotero window. Or a standalone GUI. Just a suggestion.

stweil · 2024-06-08T13:14:02Z

We know OCRmyPDF very well (see for example https://ocr-bw.bib.uni-mannheim.de/anwendung/druckwerke/). OCRmyPDF is a useful tool for those who need its many options and features. But it would make the Zotero OCR plugin much larger and more complex. I like small solutions which are simple to handle for the users. Therefore I am not sure that supporting OCRmyPDF would be a good idea.

The large PDF files which the current Tesseract produces are also a well known problem (see issue #42). I think that it must be addressed in the Tesseract code. The latest Tesseract release 5.4.0 improved the text positions in the generated PDF and makes them a little bit smaller, but much more size reductions are possible. tesseract-ocr/tesseract#4171 shows an example. My favourite is better compression of the PDF code and using image formats like JPEG 2000 with high compression inside of the PDF file.

e42mercury · 2024-06-08T15:03:10Z

OK, that makes sense. thanks for the links on the latest improvements to tesseract. I've just updated so I'll try it out. I'll keep using OCRmyPDF but I also love having the OCR plugin inside Zotero. Thanks!

stweil added the enhancement New feature or request label Jun 8, 2024

stweil added the wontfix This will not be worked on label Jul 29, 2024

stweil closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a front end for OCRmyPDF (development suggestion) #75

Build a front end for OCRmyPDF (development suggestion) #75

e42mercury commented Jun 8, 2024

stweil commented Jun 8, 2024 •

edited

Loading

e42mercury commented Jun 8, 2024

Build a front end for OCRmyPDF (development suggestion) #75

Build a front end for OCRmyPDF (development suggestion) #75

Comments

e42mercury commented Jun 8, 2024

stweil commented Jun 8, 2024 • edited Loading

e42mercury commented Jun 8, 2024

stweil commented Jun 8, 2024 •

edited

Loading