avoid OCR on non-image PDF files #28

jampy · 2020-08-19T13:55:48Z

It appears that this app uses OCR even if the PDF file is not a scanned-type.

For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index taking a lot of time processing Nextcloud Manual.pdf (a 99 pages PDF that comes with Nextcloud) and tesseract is working hard scanning it... That's simply useless.

I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.

The text was updated successfully, but these errors were encountered:

XueSheng-GIT · 2022-04-02T11:32:34Z

This still seems to be an issue. I'm on NC23.0.3 with fulltextsearch tesseract 22.0.0.
Most of my PDF files do contain a text layer, but all PDFs seem to be processed by tesseract which seems to be a waste of resources.
Any easy way to detect whether a PDF does contain a text layer and just skip those?

youduda linked a pull request Aug 1, 2023 that will close this issue

Add feature to skip OCR on PDF files with text #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid OCR on non-image PDF files #28

avoid OCR on non-image PDF files #28

jampy commented Aug 19, 2020

XueSheng-GIT commented Apr 2, 2022

avoid OCR on non-image PDF files #28

avoid OCR on non-image PDF files #28

Comments

jampy commented Aug 19, 2020

XueSheng-GIT commented Apr 2, 2022