Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

ripefig · 2019-08-05T17:42:16Z

Even testing with one language it takes ten seconds to recognize. OCR'ing the same image with tesseract via the commandline takes under a second.

I have it set to only recognize one language in both case, so am not sure what's causing the delay in dpscreenocr.

danpla · 2019-08-05T20:33:54Z

I need some info to try to reproduce the issue:

What is your OS (if Linux, which distribution)?
Which language you used?
Can you send me a sample image, or at least tell its dimensions?

ripefig · 2019-08-06T00:05:28Z

Kubuntu 19.10 You can just use the use the any block of text. I was going to record a gif, but my screen recorder is crapping out for some reason. I took me about two minutes to recognize this block of text, with ffmpeg and dpOCR taking up ~35% of CPU each throughout the process:

Even testing with one language it takes ten seconds to recognize. OCR'ing the same image with tesseract via the commandline takes under a second.

I have it set to only recognize one language in both case, so am not sure what's causing the delay in dpscreenocr.

danpla · 2019-08-06T09:56:14Z

Unfortunately, I can't reproduce it. On my Kubuntu 19.10, it takes about 4 seconds:

danpla · 2019-08-06T10:01:38Z

Is "Run executable" action enabled?

ripefig · 2019-08-06T14:22:22Z

My env is the same as yours except that you have only one option in your languages list (english), whereas I have all the international languages. I only have English selected for recognition however.

But maybe it's still looping through all the languages trying to find the right one?

danpla · 2019-08-06T21:18:06Z

I installed all languages with tesseract-ocr-all package, but the result is the same:

Can you paste the contents of the ~/.config/dpscreenocr/settings.cfg file?

ripefig · 2019-08-07T20:29:40Z

action_add_to_history true
action_copy_to_clipboard false
action_copy_to_clipboard_text_separator \n\n
action_run_executable false
action_run_executable_path
action_run_executable_wait_to_complete true
history_wrap_words true
hotkey_cancel_selection Escape
hotkey_toggle_selection Alt + Print Screen
ocr_allow_queuing true
ocr_dump_debug_image false
ocr_languages eng
ocr_split_text_blocks false
ui_active_tab 0
ui_languages_sort_column 1
ui_languages_sort_descending true
ui_native_file_dialogs true
ui_window_height 627
ui_window_maximized false
ui_window_width 626
ui_window_x 136
ui_window_y 123

Operating System: Kubuntu 19.04
KDE Plasma Version: 5.15.4
KDE Frameworks Version: 5.56.0
Qt Version: 5.12.2
Kernel Version: 5.0.0-21-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz
Memory: 11.6 GiB of RAM

danpla · 2019-08-08T08:30:04Z

I tried to test on Kubuntu 19.04, but result is the same. No idea why the program is so slow on your machine. Is there a chance that you have a custom version of libtesseract, for example, compiled manually or that comes from a third-party repository?

ripefig · 2019-08-09T00:01:28Z

Well no, it's the regular package. It can't be tesseract anyway, because my bash script takes under two seconds and it uses tesseract too.

Maybe you just have a really fast processor? My laptop is five years old, with intergrated graphics. I notice that ffmpeg hovers at 30% CPU for a while. Let me know if there is any other way to diagnose what's going on inside your program.

danpla · 2019-08-09T09:59:45Z

There are some bug reports (like the following) about performance drops in Tesseract 4; the issues are related to CPU instructions (see Hardware and CPU Requirements), but I don't think that's your case because my CPU is even older than yours and only supports SSE4.

dpScreenOCR doesn't do anything special compared to the command-line tesseract, except that it preprocess the image to improve OCR quality: converts it to grayscale, scales up 4 times, and finally performs unsharp masking. Of course, the pre-processing step takes some time (almost 2 seconds for whole my 1600x900 screen), and OCRing 4 times bigger image is slightly slower than the original, but the difference is not that high, at least not 10 times. In fact, on my machine OCR with the command-line tool is slower for the original (small) image compared to the preprocessed (4 times bigger): 2.348s vs 0.656s.

Anyway, this case is simple to test: download the following 2 images and try to send them to the command-line tesseract. The first is the original, the second is the preprocessed one:

ripefig · 2019-08-09T21:27:25Z

Yep, it's tesseract. Sorry for the confusion (I apparently it got updated since I used it in my OCR script the last time, but I was not aware so remained under the impression that tesseract was still working fine). With version 4, tesseract is completely unusable.

Times are 1m14s for the smaller file and 1m26s for the larger file.

tesseract --version

tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

So all the instructions are supported but for some reason it's relying on raw C++.

ripefig · 2019-08-12T00:06:25Z

@danpla, I filed the tesseract bug here. Again, apologies for not realizing tesseract was the problem sooner.

tesseract-ocr/tesseract#2611

danpla · 2019-08-14T17:26:07Z

No worries. Thanks for using dpScreenOCR.

ripefig · 2019-08-15T17:49:49Z

@danpla it looks like it may be a good idea to disable multicore - the potential gains are very tiny but the potential slowdown is two orders of magnitude. I am not sure what other apps using tesseract do exactly but they work fine op my system (eg gimagereader, paperwork). Presumably they prevent tesseract from maxing out the core count.

danpla · 2022-07-24T17:30:51Z

Summary for users experiencing the same problem: you need to set the OMP_THREAD_LIMIT environment variable to 1 before running dpScreenOCR. See the Troubleshooting section of the manual for details.

danpla · 2022-12-08T20:15:58Z

Starting from version 1.3.0, dpScreenOCR will automatically set OMP_THREAD_LIMIT on start.

Freredaran · 2023-02-18T17:07:38Z

Summary for users experiencing the same problem: you need to set the OMP_THREAD_LIMIT environment variable to 1 before running dpScreenOCR. See the Troubleshooting section of the manual for details.

Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)

danpla · 2023-02-18T21:02:55Z

Since dpScreenOCR 1.3.0, you no longer need to set OMP_THREAD_LIMIT manually: the program will do it automatically.

ripefig mentioned this issue Aug 9, 2019

Significant speed drop on Tesseract4 vs 3 with identical image tesseract-ocr/tesseract#1278

Closed

danpla closed this as completed Aug 14, 2019

danpla changed the title ~~Really slow.~~ Slow recognition due to multithreading issues in Tesseract 4 and 5 Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

ripefig commented Aug 5, 2019

danpla commented Aug 5, 2019

ripefig commented Aug 6, 2019 •

edited

Loading

danpla commented Aug 6, 2019

danpla commented Aug 6, 2019

ripefig commented Aug 6, 2019

danpla commented Aug 6, 2019

ripefig commented Aug 7, 2019

danpla commented Aug 8, 2019

ripefig commented Aug 9, 2019

danpla commented Aug 9, 2019 •

edited

Loading

ripefig commented Aug 9, 2019 •

edited

Loading

ripefig commented Aug 12, 2019

danpla commented Aug 14, 2019

ripefig commented Aug 15, 2019 •

edited

Loading

danpla commented Jul 24, 2022

danpla commented Dec 8, 2022

Freredaran commented Feb 18, 2023

danpla commented Feb 18, 2023

Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

Comments

ripefig commented Aug 5, 2019

danpla commented Aug 5, 2019

ripefig commented Aug 6, 2019 • edited Loading

danpla commented Aug 6, 2019

danpla commented Aug 6, 2019

ripefig commented Aug 6, 2019

danpla commented Aug 6, 2019

ripefig commented Aug 7, 2019

danpla commented Aug 8, 2019

ripefig commented Aug 9, 2019

danpla commented Aug 9, 2019 • edited Loading

ripefig commented Aug 9, 2019 • edited Loading

ripefig commented Aug 12, 2019

danpla commented Aug 14, 2019

ripefig commented Aug 15, 2019 • edited Loading

danpla commented Jul 24, 2022

danpla commented Dec 8, 2022

Freredaran commented Feb 18, 2023

danpla commented Feb 18, 2023

ripefig commented Aug 6, 2019 •

edited

Loading

danpla commented Aug 9, 2019 •

edited

Loading

ripefig commented Aug 9, 2019 •

edited

Loading

ripefig commented Aug 15, 2019 •

edited

Loading