Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow recognition due to multithreading issues in Tesseract 4 and 5 #2

Closed
ripefig opened this issue Aug 5, 2019 · 18 comments
Closed

Comments

@ripefig
Copy link

ripefig commented Aug 5, 2019

Even testing with one language it takes ten seconds to recognize. OCR'ing the same image with tesseract via the commandline takes under a second.

I have it set to only recognize one language in both case, so am not sure what's causing the delay in dpscreenocr.

@danpla
Copy link
Owner

danpla commented Aug 5, 2019

I need some info to try to reproduce the issue:

  • What is your OS (if Linux, which distribution)?
  • Which language you used?
  • Can you send me a sample image, or at least tell its dimensions?

@ripefig
Copy link
Author

ripefig commented Aug 6, 2019

Kubuntu 19.10 You can just use the use the any block of text. I was going to record a gif, but my screen recorder is crapping out for some reason. I took me about two minutes to recognize this block of text, with ffmpeg and dpOCR taking up ~35% of CPU each throughout the process:

Even testing with one language it takes ten seconds to recognize. OCR'ing the same image with tesseract via the commandline takes under a second.

I have it set to only recognize one language in both case, so am not sure what's causing the delay in dpscreenocr.

@danpla
Copy link
Owner

danpla commented Aug 6, 2019

Unfortunately, I can't reproduce it. On my Kubuntu 19.10, it takes about 4 seconds:
ImpartialThreadbareBrocketdeer-size_restricted

@danpla
Copy link
Owner

danpla commented Aug 6, 2019

Is "Run executable" action enabled?

@ripefig
Copy link
Author

ripefig commented Aug 6, 2019

My env is the same as yours except that you have only one option in your languages list (english), whereas I have all the international languages. I only have English selected for recognition however.

But maybe it's still looping through all the languages trying to find the right one?

@danpla
Copy link
Owner

danpla commented Aug 6, 2019

I installed all languages with tesseract-ocr-all package, but the result is the same:
BruisedImprobableAsiandamselfly-size_restricted

Can you paste the contents of the ~/.config/dpscreenocr/settings.cfg file?

@ripefig
Copy link
Author

ripefig commented Aug 7, 2019

action_add_to_history true
action_copy_to_clipboard false
action_copy_to_clipboard_text_separator \n\n
action_run_executable false
action_run_executable_path
action_run_executable_wait_to_complete true
history_wrap_words true
hotkey_cancel_selection Escape
hotkey_toggle_selection Alt + Print Screen
ocr_allow_queuing true
ocr_dump_debug_image false
ocr_languages eng
ocr_split_text_blocks false
ui_active_tab 0
ui_languages_sort_column 1
ui_languages_sort_descending true
ui_native_file_dialogs true
ui_window_height 627
ui_window_maximized false
ui_window_width 626
ui_window_x 136
ui_window_y 123

Operating System: Kubuntu 19.04
KDE Plasma Version: 5.15.4
KDE Frameworks Version: 5.56.0
Qt Version: 5.12.2
Kernel Version: 5.0.0-21-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz
Memory: 11.6 GiB of RAM

@danpla
Copy link
Owner

danpla commented Aug 8, 2019

I tried to test on Kubuntu 19.04, but result is the same. No idea why the program is so slow on your machine. Is there a chance that you have a custom version of libtesseract, for example, compiled manually or that comes from a third-party repository?
GeneralBlindHairstreakbutterfly-size_restricted

@ripefig
Copy link
Author

ripefig commented Aug 9, 2019

Well no, it's the regular package. It can't be tesseract anyway, because my bash script takes under two seconds and it uses tesseract too.

Maybe you just have a really fast processor? My laptop is five years old, with intergrated graphics. I notice that ffmpeg hovers at 30% CPU for a while. Let me know if there is any other way to diagnose what's going on inside your program.

@danpla
Copy link
Owner

danpla commented Aug 9, 2019

There are some bug reports (like the following) about performance drops in Tesseract 4; the issues are related to CPU instructions (see Hardware and CPU Requirements), but I don't think that's your case because my CPU is even older than yours and only supports SSE4.

dpScreenOCR doesn't do anything special compared to the command-line tesseract, except that it preprocess the image to improve OCR quality: converts it to grayscale, scales up 4 times, and finally performs unsharp masking. Of course, the pre-processing step takes some time (almost 2 seconds for whole my 1600x900 screen), and OCRing 4 times bigger image is slightly slower than the original, but the difference is not that high, at least not 10 times. In fact, on my machine OCR with the command-line tool is slower for the original (small) image compared to the preprocessed (4 times bigger): 2.348s vs 0.656s.

Anyway, this case is simple to test: download the following 2 images and try to send them to the command-line tesseract. The first is the original, the second is the preprocessed one:

1
2

@ripefig
Copy link
Author

ripefig commented Aug 9, 2019

Yep, it's tesseract. Sorry for the confusion (I apparently it got updated since I used it in my OCR script the last time, but I was not aware so remained under the impression that tesseract was still working fine). With version 4, tesseract is completely unusable.

Times are 1m14s for the smaller file and 1m26s for the larger file.

tesseract --version

tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

So all the instructions are supported but for some reason it's relying on raw C++.

@ripefig
Copy link
Author

ripefig commented Aug 12, 2019

@danpla, I filed the tesseract bug here. Again, apologies for not realizing tesseract was the problem sooner.

tesseract-ocr/tesseract#2611

@danpla
Copy link
Owner

danpla commented Aug 14, 2019

No worries. Thanks for using dpScreenOCR.

@danpla danpla closed this as completed Aug 14, 2019
@ripefig
Copy link
Author

ripefig commented Aug 15, 2019

@danpla it looks like it may be a good idea to disable multicore - the potential gains are very tiny but the potential slowdown is two orders of magnitude. I am not sure what other apps using tesseract do exactly but they work fine op my system (eg gimagereader, paperwork). Presumably they prevent tesseract from maxing out the core count.

@danpla danpla changed the title Really slow. Slow recognition due to multithreading issues in Tesseract 4 and 5 Apr 20, 2022
@danpla
Copy link
Owner

danpla commented Jul 24, 2022

Summary for users experiencing the same problem: you need to set the OMP_THREAD_LIMIT environment variable to 1 before running dpScreenOCR. See the Troubleshooting section of the manual for details.

@danpla
Copy link
Owner

danpla commented Dec 8, 2022

Starting from version 1.3.0, dpScreenOCR will automatically set OMP_THREAD_LIMIT on start.

@Freredaran
Copy link

Summary for users experiencing the same problem: you need to set the OMP_THREAD_LIMIT environment variable to 1 before running dpScreenOCR. See the Troubleshooting section of the manual for details.

Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)

@danpla
Copy link
Owner

danpla commented Feb 18, 2023

Since dpScreenOCR 1.3.0, you no longer need to set OMP_THREAD_LIMIT manually: the program will do it automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants