You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that this app uses OCR even if the PDF file is not a scanned-type.
For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index taking a lot of time processing Nextcloud Manual.pdf (a 99 pages PDF that comes with Nextcloud) and tesseract is working hard scanning it... That's simply useless.
I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.
The text was updated successfully, but these errors were encountered:
This still seems to be an issue. I'm on NC23.0.3 with fulltextsearch tesseract 22.0.0.
Most of my PDF files do contain a text layer, but all PDFs seem to be processed by tesseract which seems to be a waste of resources.
Any easy way to detect whether a PDF does contain a text layer and just skip those?
It appears that this app uses OCR even if the PDF file is not a scanned-type.
For example, I have a fresh Nextcloud installation and I see
php occ fulltextsearch:index
taking a lot of time processingNextcloud Manual.pdf
(a 99 pages PDF that comes with Nextcloud) andtesseract
is working hard scanning it... That's simply useless.I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.
The text was updated successfully, but these errors were encountered: