Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage on multicore #61

Open
XueSheng-GIT opened this issue Aug 23, 2023 · 0 comments · May be fixed by #62
Open

High CPU usage on multicore #61

XueSheng-GIT opened this issue Aug 23, 2023 · 0 comments · May be fixed by #62

Comments

@XueSheng-GIT
Copy link

XueSheng-GIT commented Aug 23, 2023

I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.

After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also tesseract-ocr/tesseract#898).

Thus, I did some testing with this thread limit...
OCR Settings:
Limit PDF pages: 20
Timeout: 60 seconds

Testfile: Nextcloud Manual.pdf

I measured the runtime for this loop:

for ($i = 1; $i <= $pages; $i++) {

Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.

As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant