Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

Open
jpubb opened this issue Sep 25, 2020 · 4 comments
Assignees

Comments

@jpubb
Copy link

jpubb commented Sep 25, 2020

PROBLEM:
When PDF is selected in "Files - Tesseract OCR" options, if the elastic indexing task encounters any PDF's that ghostscript (used by /spatie/pdf-to-image which is used by this app) considers "bad", then those files will be deleted and lost during the failed conversion process.

Severity:
Critical if you enable [x] PDF within the app. Because you can not guarantee that users will not upload pdf's which ghostscript considers "bad". If they do, they will be deleted and lost during indexing.

More details
In my case I had tesseract PEM set to 12 and limit PDF pages set to 10, though neither setting should matter here.

The error is thrown during indexing (when PDF is enabled in Files - Tesseract OCR app) by ghostscript is something like:

**** Error: stream operator isn't terminated by valid EOL. Output may be incorrect.

If you search for this "warning" from Ghostscript, you can see that many people have encountered it over time. Which means that many different "PDF" creation libraries potentially may cause it to occcur. In our case, I believe it is caused by whatever NAPS2 (https://github.com/cyanfish/naps2) is using to save as pdf.

Suggested Solution for Files - Tesseract OCR

We can not assume pdf-to-image was successfull. Preserve source/input .pdf until it is confirmed that an OCR-scanned PDF of the source file has been generated.

@ArtificialOwl ArtificialOwl self-assigned this Sep 28, 2020
@W9CR
Copy link

W9CR commented Jun 1, 2022

I'm running into this issue. Under no circumstances should we loose data. I've disabled this now as well.

@ArtificialOwl
Copy link
Member

do you have anything in the nextcloud logs regarding the issue ?

do you have the ocr app enabled ?
https://apps.nextcloud.com/apps/ocr

I am again looking at the code to confirm it but there should be no reason for the app fulltextsearch_elasticsearch ask the app files_fulltextsearch to delete the file from the filesystem

@W9CR
Copy link

W9CR commented Jun 2, 2022

I apologize I jumped the gun on this. I did verify that this is not changing the files on the server by looking at the modification time. What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?

I have some rather large PDF's (24x51" at 600 dpi) and this seems to just fail on them at the ghostscript level.

Is there a log level that I can enable to make this easier to see?

@ArtificialOwl
Copy link
Member

ArtificialOwl commented Jun 2, 2022

What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?

Are you are talking about a file generated by the OCR App from a PDF with no text-layer ?

If you have a file that failed to be indexed by FullTextSearch and is deleted from Nextcloud during the index, I am really interested to reproduce this issue :)

Try this, this will level up the debug level to 3 for this app:

./occ config:app:set files_fulltextsearch_tesseract debug_level --value 3

You can have more data using:

./occ config:app:set files_fulltextsearch_tesseract debug_trace --value '1'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants