Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a front end for OCRmyPDF (development suggestion) #75

Closed
e42mercury opened this issue Jun 8, 2024 · 2 comments
Closed

Build a front end for OCRmyPDF (development suggestion) #75

e42mercury opened this issue Jun 8, 2024 · 2 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@e42mercury
Copy link

I love that I can use this directly in Zotero, but it has few options and sometimes creates really huge files. A lot of the feature requests and errors people are asking about here have already been addressed by another program, OCRmyPDF. It is based on the same tesseract engine but also handles rotation, compression, languages... really an excellent program all around. Instead of rewriting a scaled-down version of the same program as a Zotero plugin, I'd suggest a plugin that is just a front end for OCRmyPDF. Just a window inside Zotero that would help you 1) install everything and 2) choose the many options in OCRmyPDF, described in its excellent documentation, while it runs in the background.

The only problem is that it's command line only. So it would definitely be a contribution to the use community if someone made it useable from a Zotero window. Or a standalone GUI. Just a suggestion.

@stweil stweil added the enhancement New feature or request label Jun 8, 2024
@stweil
Copy link
Member

stweil commented Jun 8, 2024

We know OCRmyPDF very well (see for example https://ocr-bw.bib.uni-mannheim.de/anwendung/druckwerke/). OCRmyPDF is a useful tool for those who need its many options and features. But it would make the Zotero OCR plugin much larger and more complex. I like small solutions which are simple to handle for the users. Therefore I am not sure that supporting OCRmyPDF would be a good idea.

The large PDF files which the current Tesseract produces are also a well known problem (see issue #42). I think that it must be addressed in the Tesseract code. The latest Tesseract release 5.4.0 improved the text positions in the generated PDF and makes them a little bit smaller, but much more size reductions are possible. tesseract-ocr/tesseract#4171 shows an example. My favourite is better compression of the PDF code and using image formats like JPEG 2000 with high compression inside of the PDF file.

@e42mercury
Copy link
Author

OK, that makes sense. thanks for the links on the latest improvements to tesseract. I've just updated so I'll try it out. I'll keep using OCRmyPDF but I also love having the OCR plugin inside Zotero. Thanks!

@stweil stweil added the wontfix This will not be worked on label Jul 29, 2024
@stweil stweil closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants