If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

jpubb · 2020-09-25T20:45:46Z

PROBLEM:
When PDF is selected in "Files - Tesseract OCR" options, if the elastic indexing task encounters any PDF's that ghostscript (used by /spatie/pdf-to-image which is used by this app) considers "bad", then those files will be deleted and lost during the failed conversion process.

Severity:
Critical if you enable [x] PDF within the app. Because you can not guarantee that users will not upload pdf's which ghostscript considers "bad". If they do, they will be deleted and lost during indexing.

More details
In my case I had tesseract PEM set to 12 and limit PDF pages set to 10, though neither setting should matter here.

The error is thrown during indexing (when PDF is enabled in Files - Tesseract OCR app) by ghostscript is something like:

**** Error: stream operator isn't terminated by valid EOL. Output may be incorrect.

If you search for this "warning" from Ghostscript, you can see that many people have encountered it over time. Which means that many different "PDF" creation libraries potentially may cause it to occcur. In our case, I believe it is caused by whatever NAPS2 (https://github.com/cyanfish/naps2) is using to save as pdf.

Suggested Solution for Files - Tesseract OCR

We can not assume pdf-to-image was successfull. Preserve source/input .pdf until it is confirmed that an OCR-scanned PDF of the source file has been generated.

The text was updated successfully, but these errors were encountered:

W9CR · 2022-06-01T15:50:46Z

I'm running into this issue. Under no circumstances should we loose data. I've disabled this now as well.

ArtificialOwl · 2022-06-01T16:35:42Z

do you have anything in the nextcloud logs regarding the issue ?

do you have the ocr app enabled ?
https://apps.nextcloud.com/apps/ocr

I am again looking at the code to confirm it but there should be no reason for the app fulltextsearch_elasticsearch ask the app files_fulltextsearch to delete the file from the filesystem

W9CR · 2022-06-02T18:26:46Z

I apologize I jumped the gun on this. I did verify that this is not changing the files on the server by looking at the modification time. What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?

I have some rather large PDF's (24x51" at 600 dpi) and this seems to just fail on them at the ghostscript level.

Is there a log level that I can enable to make this easier to see?

ArtificialOwl · 2022-06-02T19:02:42Z

What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?

Are you are talking about a file generated by the OCR App from a PDF with no text-layer ?

If you have a file that failed to be indexed by FullTextSearch and is deleted from Nextcloud during the index, I am really interested to reproduce this issue :)

Try this, this will level up the debug level to 3 for this app:

./occ config:app:set files_fulltextsearch_tesseract debug_level --value 3

You can have more data using:

./occ config:app:set files_fulltextsearch_tesseract debug_trace --value '1'

ArtificialOwl self-assigned this Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

jpubb commented Sep 25, 2020

W9CR commented Jun 1, 2022

ArtificialOwl commented Jun 1, 2022

W9CR commented Jun 2, 2022

ArtificialOwl commented Jun 2, 2022 •

edited

Loading

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST. #30

Comments

jpubb commented Sep 25, 2020

W9CR commented Jun 1, 2022

ArtificialOwl commented Jun 1, 2022

W9CR commented Jun 2, 2022

ArtificialOwl commented Jun 2, 2022 • edited Loading

ArtificialOwl commented Jun 2, 2022 •

edited

Loading