-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is Tesseract built with --disable-openmp #127
Comments
IIRC the overhead of CPU-parallelization with |
Okay, so it might be inefficient usage of excess resources, but usage nevertheless! If you have a single workspace you want to process as fast as possible, and multiple cores kicking each others' heels, then having any within-module parallelization is better than none. And you can always disable OpenMP at runtime (by setting |
That setting is there because I think it's the best for most use cases. Tesseract with OpenMP uses exactly 4 threads for parts of the total OCR process (probably for most of ocrd-tesserocr-recognize). In practice this accelerates the OCR process, but not by a factor 4. If you are lucky, you will get a factor of 2 or 3. The total CPU usage is much higher because of the overhead mentioned by @kba, with CPUs burning energy while synchronizing threads. There remains a significant overhead even if you set For training the situation is even worse, because training uses up to 8 threads (really bad on typical PCs which only support 4 or 6 parallel threads) and the overhead is so large that there is only a very small performance gain if at all. On the over hand it is really easy and works very efficiently without any overhead to use parallel Tesseract runs, both for recognition (single threaded) and training (up to two threads). Even with OCR-D that can already be used with different workspaces. Instead of changing the compiler options, it would be better to support parallelisation on the page level for selected processors. That would allow optimized usage of the available resources, from 2 or 4 threads on older hardware, 6 threads on current PCs, 32 or even up to 128 threads on recent server hardware. |
@bertsky, a simple solution for you could be implemented by moving Tesseract's configure options into a macro, so you could override it in your personal |
Thanks @stweil for that clarification. So we cannot opt out of the overhead with the envvar, and it breaks some use cases having OpenMP compiled in. Then I agree we should ignore this option, and focus on implementing OCR-D/core#322 as a general solution. |
Ah yes, having that option would be useful I think.
I thought about that. Is there anything you know (besides loosing benefits of static build) that Alex' PPA builds could cause trouble with? |
Nothing that I am aware of. Of course Alex' PPA builds are typically a little behind Git master (but usually not to much). And they must be told where to find the OCR-D model files for Tesseract because they use a different default path. |
Right! But we can override One complication is that ocrd_tesserocr currently wraps tesseract-ocr (4.1.1) instead of tesseract-ocr-devel (5.0) in its |
No, sorry, I don't remember that. Maybe because 4.1.x is the official stable version, while there is currently no released 5.0? Version 5 is a moving target. The major version was increased from 4 to 5 because it is incompatible on the API level (for example classes and structs were optimized, header files removed, proprietary data types removed), and I would like to keep it open until more of that kind was done. Otherwise we'd have to go to release 6 for the next incompatible change. |
Sure, there's gotta be major for all dev work, I have always found that a good decision.
Either that or because of the dependency on And would you recommend against using 4.1.1 for OCR-D purposes, or is this difference also negligible on average? |
Git master / 5 typically has more fixes which are only backported from time to time. And it has a better performance. In most cases I use OCR-D with a very recent Tesseract, but would not recommend against using 4.1.1. |
Great, then we could change ocrd_tesserocr's Or should we keep the deb-based installation variant more conservative on purpose? (Then I do indeed need a way to customize the |
This goes back to the very first commit bringing Tesseract build rules by @stweil:
ocrd_all/Makefile
Line 468 in 8fb9ee3
Disabling OpenMP means loosing implicit CPU parallelization, which can speed up single-job workflows significantly.
Native system packages are usually built with OpenMP enabled.
Can we please drop
--disable-openmp
?The text was updated successfully, but these errors were encountered: