You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.
Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.
As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).
The text was updated successfully, but these errors were encountered:
I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.
After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also tesseract-ocr/tesseract#898).
Thus, I did some testing with this thread limit...
OCR Settings:
Limit PDF pages: 20
Timeout: 60 seconds
Testfile: Nextcloud Manual.pdf
I measured the runtime for this loop:
files_fulltextsearch_tesseract/lib/Service/TesseractService.php
Line 252 in e1405e4
Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.
As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).
The text was updated successfully, but these errors were encountered: