Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid OCR on non-image PDF files #28

Open
jampy opened this issue Aug 19, 2020 · 1 comment · May be fixed by #58
Open

avoid OCR on non-image PDF files #28

jampy opened this issue Aug 19, 2020 · 1 comment · May be fixed by #58

Comments

@jampy
Copy link

jampy commented Aug 19, 2020

It appears that this app uses OCR even if the PDF file is not a scanned-type.

For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index taking a lot of time processing Nextcloud Manual.pdf (a 99 pages PDF that comes with Nextcloud) and tesseract is working hard scanning it... That's simply useless.

I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.

@XueSheng-GIT
Copy link

This still seems to be an issue. I'm on NC23.0.3 with fulltextsearch tesseract 22.0.0.
Most of my PDF files do contain a text layer, but all PDFs seem to be processed by tesseract which seems to be a waste of resources.
Any easy way to detect whether a PDF does contain a text layer and just skip those?

@youduda youduda linked a pull request Aug 1, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants