-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611
Comments
The solution is to set |
I cannot reproduce your timing results on a recent Debian system:
With
|
Is the test image available somewhere? I would like to try it on a non-AVX system. |
This comment has been minimized.
This comment has been minimized.
It's given in the initial report: https://user-images.githubusercontent.com/45201036/62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png. Even without AVX it should not take more than a second. |
Thanks @stweil . Here are the results on my system - Linux tesseract-ocr 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:54:50 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
|
@Shreeshrii seems like you're have the same problem, given that your system is much more powerful. |
Test result on an ARM system:
With
The results for 4.0.0 and latest Git master are similar. |
@ripefig, your results could be explained if Tesseract cannot get 4 CPU cores. On my ARM system which has 4 cores I get a faster result with
That also reduces the huge overhead in the user time which occurs with 4 threads. |
|
@stweil Is there any solution? Maybe limit the default number of cores to 1 (or max cores - 1) until Tesseract can reliably work with all cores? Seems like it's completely broken for a lot of users and problem has persisted for years. This also breaks all the software that uses tesseract. |
The right solution depends on your hardware (number of cores, memory interface) and your use case: on some hardware using more than one core results in faster OCR (see my results above), and training is much faster with 4 cores. It is always possible to either set Because there are acceptable solutions for the speed issue, my current first priority is improving quality, not looking how to improve multithreading. If you or someone else finds a better solution for multithreading, a pull request would be welcomed. |
Out of the box, it takes about one hour to OCR a single page of text. It would take one month to OCR a textbook, and the CPU would probably fry. I think most users would consider this "completely broken," in the sense of not being usable. The issue affects both AVX and non-AVX systems. The program is capable of cutting down times by two orders of magnitude in both cases, as demonstrated in this thread. Why not just limit the core count by default until the issue is fixed? Of course, one could argue that it's up to application developers to make sure tesseract works on the target system. (I just tried a few OCR apps and most of them work fine - so it looks like they are fixing it on their end somehow). |
That's simply not true. It is slow on your notebook. On my six year old notebook (no AVX, 4 x Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz) the official Debian package works pretty good:
|
I didn't say it affects all systems, but it's frequent enough to warrant some kind of change. Multicore might result in 5-30% speed improvement in certain cases but it can also result in a 10000% speed decrease on many systems. Perhaps you're saying this is an Ubuntu issue? |
Based on a user suggestion and tesseract-ocr/tesseract#2611, I reviewed thread limits and found that thread limit of 3 is still beneficial, but not 4. > time env OMP_THREAD_LIMIT=2 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 116.67user 1.67system 1:26.26elapsed 137%CPU (0avgtext+0avgdata 356752maxresident)k 2213inputs+0outputs (18major+131059minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=3 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 136.89user 1.63system 1:19.56elapsed 174%CPU (0avgtext+0avgdata 356784maxresident)k 821inputs+0outputs (0major+131080minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=4 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 161.31user 1.51system 1:18.80elapsed 206%CPU (0avgtext+0avgdata 356632maxresident)k 8477inputs+0outputs (12major+131074minor)pagefaults 0swaps > time env OMP_THREAD_LIMIT=8 tesseract omp4.png stdout >/dev/null Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 143 160.30user 1.62system 1:18.01elapsed 207%CPU (0avgtext+0avgdata 356640maxresident)k 821inputs+0outputs (0major+131078minor)pagefaults 0swaps
|
Indeed, that whole multithreading thing caused more harm than good. There are a few issues around regarding this. I believe some people even compile specialized versions where OMP is completely removed since it runs more or less as fast, but with way less CPU consumption. |
closing as duplicate to #263 |
@stweil I want to compare the timing on Power8 to AVX2. I notice that the results you reported were with tesseract 4.0.0. Please rerun the test with the latest code. My current result is:
|
@Shreeshrii, my results for Power8 differ significantly when I use tessdata_fast. Did you test with tessdata_best? Power8 can be improved a lot by using SIMD. For tessdata_best that is easy to implement. Intermediate results (more results will get added later) with git master:
|
Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so |
New results for an ARMv8 based NVIDIA Xavier running Ubunto Bionic:
Tesseract must be called with |
The old ARMv7 results were made with
For this host, SIMD makes no difference. Tesseract uses NEON anyway. |
Can a different dot product calculation be used for Altivec - please see slide 16 in https://www.nxp.com/files-static/training_presentation/TP_ALTIVEC.pdf
|
It's done automatically with the openmp-simd code. |
@stweil, did you benchmarked this against the manual code in a x86-64 machine? |
Shouldn't I added the following to my build script.
Now,
|
@stweil, It does not make sense to disable openmp and then to enable it. |
clang and gcc (>=4.9) both support the flag |
This code is more complete. |
|
|
What is obvious from the timing results is that there is a lot of variation across platforms and across options. I saw a lot of variation in time even on the same platform - see #2611 (comment) |
@stweil Which parts of training process will be speeded up by this? |
Yes, But I still do not know how |
You had also mentioned earlier, in a different thread, about unrolled loops. Should that also be implemented along with this for Power? |
Ideally loop unrolling should also be done by the compiler (try |
I set both Line 254 in 2b68898
linlsq_test failed with the following error. It works when I removed the -ffast-math .
|
Is using up to 8 cores = 8 threads? tesseract/src/ccstruct/imagedata.h Line 315 in 944c1d9
// A collection of DocumentData that knows roughly how much memory it is using. |
Yes, I should have written "up to 8 threads". If there are only two CPUs with hyperthreading, those 8 threads will run on 4 cores, and the performance will be rather low. Of course you can use The cache for image data works also with a separate thread, but that does not use OpenMP, so it also works when OpenMP was disabled. |
I was asking about threads with regard to the above comment, whether multiple threads lead to slowing down. |
|
I have posted above the results of my test on power8 running tesstutorial (using scripts in shreeshrii/tess4training) with tesseract built from git master vs with SIMD patch as suggested by @stweil. I ran the scripts one by one, without any other process running in the VM so as to get results that should be comparable. The build was done using Advanced toolchain rather than the distro's gcc since
Build options in both cases included the following:
|
Questions:
Should I expect the result to be very different from
or should i use the following?
|
Updated training test results in #2611 (comment) |
Tesseract was very slow running it within a script but I noticed it was very fast within a terminal then, using also your suggestions, I modify the script line in "OMP_THREAD_LIMIT=1 xterm -geometry 1X1+0+0 -e tesseract file1 file2" and obtained the same speed. |
Environment
> tesseract -v
tesseract-snap -v
tesseract-ocr-eng
:1:4.00~git30-7274cfa-1
I used the training data from the ubuntu repos for both
tesseract
andtesseract-snap
, since no data is provided with the snap.Operating System: Kubuntu 19.04
KDE Plasma Version: 5.15.4
KDE Frameworks Version: 5.56.0
Qt Version: 5.12.2
Kernel Version: 5.0.0-21-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz
Memory: 11.6 GiB of RAM
Current Behavior:
It takes over a minute of 100% CPU load to scan an image (directly below) with two sentences :
results for tesseract 4:
> time tesseract -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract 5:
> time tesseract-snap -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
I tried to OCR a one page doc, but I had to exit the psenterocess. It would probably take one hour of full CPU load.Unfortunately I don't have Tesseract 3 to compare, but I remember using it in an OCR screenshotting script it felt as fast as regular copy and paste, so definitely under two seconds for this block of text.
Expected Behavior:
It shouldn't take this long to scan two sentences.
Suggested Fix
Disable multithreading by default until its fixed.
The text was updated successfully, but these errors were encountered: