You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PROBLEM:
When PDF is selected in "Files - Tesseract OCR" options, if the elastic indexing task encounters any PDF's that ghostscript (used by /spatie/pdf-to-image which is used by this app) considers "bad", then those files will be deleted and lost during the failed conversion process.
Severity:
Critical if you enable [x] PDF within the app. Because you can not guarantee that users will not upload pdf's which ghostscript considers "bad". If they do, they will be deleted and lost during indexing.
More details
In my case I had tesseract PEM set to 12 and limit PDF pages set to 10, though neither setting should matter here.
The error is thrown during indexing (when PDF is enabled in Files - Tesseract OCR app) by ghostscript is something like:
**** Error: stream operator isn't terminated by valid EOL. Output may be incorrect.
If you search for this "warning" from Ghostscript, you can see that many people have encountered it over time. Which means that many different "PDF" creation libraries potentially may cause it to occcur. In our case, I believe it is caused by whatever NAPS2 (https://github.com/cyanfish/naps2) is using to save as pdf.
Suggested Solution for Files - Tesseract OCR
We can not assume pdf-to-image was successfull. Preserve source/input .pdf until it is confirmed that an OCR-scanned PDF of the source file has been generated.
The text was updated successfully, but these errors were encountered:
I am again looking at the code to confirm it but there should be no reason for the app fulltextsearch_elasticsearch ask the app files_fulltextsearch to delete the file from the filesystem
I apologize I jumped the gun on this. I did verify that this is not changing the files on the server by looking at the modification time. What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?
I have some rather large PDF's (24x51" at 600 dpi) and this seems to just fail on them at the ghostscript level.
Is there a log level that I can enable to make this easier to see?
What I am seeing is PDF's that already have OCR, they are being run through tesseract regardless. Is this normal behavior?
Are you are talking about a file generated by the OCR App from a PDF with no text-layer ?
If you have a file that failed to be indexed by FullTextSearch and is deleted from Nextcloud during the index, I am really interested to reproduce this issue :)
Try this, this will level up the debug level to 3 for this app:
PROBLEM:
When PDF is selected in "Files - Tesseract OCR" options, if the elastic indexing task encounters any PDF's that ghostscript (used by /spatie/pdf-to-image which is used by this app) considers "bad", then those files will be deleted and lost during the failed conversion process.
Severity:
Critical if you enable [x] PDF within the app. Because you can not guarantee that users will not upload pdf's which ghostscript considers "bad". If they do, they will be deleted and lost during indexing.
More details
In my case I had tesseract PEM set to 12 and limit PDF pages set to 10, though neither setting should matter here.
The error is thrown during indexing (when PDF is enabled in Files - Tesseract OCR app) by ghostscript is something like:
**** Error: stream operator isn't terminated by valid EOL. Output may be incorrect.
If you search for this "warning" from Ghostscript, you can see that many people have encountered it over time. Which means that many different "PDF" creation libraries potentially may cause it to occcur. In our case, I believe it is caused by whatever NAPS2 (https://github.com/cyanfish/naps2) is using to save as pdf.
Suggested Solution for Files - Tesseract OCR
We can not assume pdf-to-image was successfull. Preserve source/input .pdf until it is confirmed that an OCR-scanned PDF of the source file has been generated.
The text was updated successfully, but these errors were encountered: