Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

good accuracy but too slow, how to improve Tesseract speed #263

Closed
ychtioui opened this issue Mar 10, 2016 · 91 comments
Closed

good accuracy but too slow, how to improve Tesseract speed #263

ychtioui opened this issue Mar 10, 2016 · 91 comments

Comments

@ychtioui
Copy link

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
00060

@stweil
Copy link
Member

stweil commented Mar 10, 2016

You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images. I am working on a project where OCR with Tesseract would take nearly 7 years on a single core, but luckily I can try to get many computers and use their cores, so the time can be reduced to a few days.
Using compiler settings which are optimized for your CPU helps to gain a few percent, but I am afraid that for a larger gain different algorithms in Tesseract and its libraries would be needed.

@ychtioui
Copy link
Author

Besides the OCR, we have other things that need to run on the other cores.
I believe, the main issue that's slowing down Tesseract is the way memory is managed.
Too many memory allocations (new function) and releases (delete or delete [] functions) do slow down the reader.
In the past, I did use a different OCR engine, and it was allocating up-front large buffers to store all the needed data (large buffer of blobs, a large buffer of lines, a large buffer of words and their corresponding data), the buffers were just being indexed as we were reading the data from an image. The large buffers were allocated only once upon ocr engine initialization and release only once upon ocr engine shutdown. This memory management scheme was very efficient computational-time-wise.
Are there any settings for Tesseract that are known to be computationally intensive?
any tricks to speed up Tesseract?

@tfmorris
Copy link
Contributor

What evidence is your memory management speculation based on?

@ychtioui
Copy link
Author

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

@stweil
Copy link
Member

stweil commented Mar 11, 2016

@ychtioui, as you have spent many years in machine vision, you know quite well that there are lots of ways why programs can be slow. Memory management is just one of them. Even with a lot of experience, I'd start running performance analyzers to investigate performance issues. Of course I can guess what might be possible reasons and try to improve the software based on that guesses, but improvements based on evidence (like the result of a performance analysis) are more efficient. Don't you think so, too? Do you have a chance to run a performance analysis?

@zdenop
Copy link
Contributor

zdenop commented Mar 11, 2016

You can try to use 3.02 version if you need only English. AFAIR it was
singnificantly faster on my (old) computer.

Zdenko

On Thu, Mar 10, 2016 at 4:35 PM, younes notifications@github.com wrote:

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1
second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the
OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
[image: 00060]
https://cloud.githubusercontent.com/assets/9968625/13674495/ac261db4-e6ab-11e5-9b4a-ad91d5b4ff87.jpg


Reply to this email directly or view it on GitHub
#263.

@ychtioui
Copy link
Author

I'm running version 3.02
I'm going through different sections of the reader, and checking which section is taking the most time.

is it typical to read images (such as mine attached above) in a few seconds?

thanks for your comments.

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2016

... 3.02 version ... AFAIR it was significantly faster on my (old) computer.

3.02 3.02.02 is compiled with '-O3' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.02.02/configure.ac#L161

3.03 and 3.04 are compiled with '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.03-rc1/configure.ac#L201
https://github.com/tesseract-ocr/tesseract/blob/3.04.01/configure.ac#L300

2.04 and 3.01 are compiled with '-O0' '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/2.04/configure.ac
https://github.com/tesseract-ocr/tesseract/blob/3.01/configure.ac
The 'configure.ac' script in these versions does not explicitly set the '-O' level, so autotools will use '-O0' '-O2' as default.

@ychtioui
Copy link
Author

thanks amitdo.
I'm using 3.02 but the C/C++ version of Tesseract.
I couldn't find the setting -O3 in the source files. where is it?

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2016

What I linked to was actually 3.02.02

I think this is 3.02:
https://github.com/tesseract-ocr/tesseract/blob/d581ab7e12a2fac4a73ac0af4ce7ec522b8f3e42/configure.ac

You are right. It does not contain any '-On' flag, so the compiler will use '-O0', which is not good for speed. so if you are using autotools to build Tesseract it will instruct the compiler to use '-O2'.

@amitdo
Copy link
Collaborator

amitdo commented Mar 18, 2016

I assume you are using Tesseract on Linux / FreeBSD / Mac. On Windows + MS Visual C++ the configure.ac file is irrelevant.

@Shreeshrii
Copy link
Collaborator

@ychtioui said in a post above "I use VS2010" so using Windows.

@amitdo
Copy link
Collaborator

amitdo commented Mar 19, 2016

Thanks Shree.

I don't know which optimization level is used for Visual C++.

@ychtioui
Copy link
Author

I use vs2010 on a Windows 7 pc.
Project settings or building options won't change much the read speed.
Tesseract was designed in research labs. Most of the key sections of the reader are speed-don't-care.
I used some performance tools to analyze where most of the computation time is spent.
In the page layout section, the blob analyzer does a lot of new/delete. This is very time consuming. The attached image above has more than 3600 blobs. Besides a number of processings are done on each blob (distance transform, finding the enclosing rectangle, measuring blob parameters, etc.). The allocations (new) and the release (delete) of all these blobs is very time consuming.
If we use a global array (allocate upfront) of blobs (exactly object BLOBNBOX) and whenever we need a blob, just get one index from the array. The array will be released once when we shut down the engine.
I used this concept in another single line ocr reader and it's super fast.

@zdenop
Copy link
Contributor

zdenop commented Mar 19, 2016

VS2010 use optimization flag /O2 (Maximize speed) - other flags are set to default.
In past in forum there were warnings against using compiler optimization flag as they affect also OCR results. This is reason why there are standard optimization flags (-O2 in autotools and /O2 in VS).

I tried to run perf tool on linux:
perf record tesseract eurotext.tif eurotext
and I got this report (perf report):

  39,77%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  13,98%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  13,09%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   4,22%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   2,66%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   1,48%  tesseract  libtesseract.so.3.0.4  [.] ELIST_ITERATOR::forward
   1,16%  tesseract  libc-2.19.so           [.] _int_malloc
   1,15%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   1,01%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
   0,87%  tesseract  liblept.so.5.0.0       [.] rasteropLow
   0,79%  tesseract  libm-2.19.so           [.] __mul
   0,72%  tesseract  libtesseract.so.3.0.4  [.] FPCUTPT::assign
   0,71%  tesseract  libc-2.19.so           [.] _int_free
   0,71%  tesseract  libtesseract.so.3.0.4  [.] ELIST::add_sorted_and_find
   0,61%  tesseract  libtesseract.so.3.0.4  [.] tesseract::AmbigSpec::compare_ambig_specs
   0,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   0,52%  tesseract  libc-2.19.so           [.] memset
   0,49%  tesseract  libc-2.19.so           [.] vfprintf
   0,45%  tesseract  libc-2.19.so           [.] malloc
   0,36%  tesseract  libtesseract.so.3.0.4  [.] SegmentLLSQ
   0,31%  tesseract  libm-2.19.so           [.] __ieee754_atan2_sse2
   0,31%  tesseract  libc-2.19.so           [.] malloc_consolidate
   0,30%  tesseract  libtesseract.so.3.0.4  [.] LLSQ::add
   0,29%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::operator+=
   0,29%  tesseract  libtesseract.so.3.0.4  [.] _ZN14ELIST_ITERATOR7forwardEv@plt
   0,28%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ComputeFeatures
   0,25%  tesseract  liblept.so.5.0.0       [.] pixScanForForeground
   0,24%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::reserve
   0,20%  tesseract  libtesseract.so.3.0.4  [.] C_OUTLINE::increment_step
   0,20%  tesseract  [kernel.kallsyms]      [k] clear_page

according this report 3 top function consumed 66% of "time".

Then I tried 4 pages (A4 ) tiff (G4 compressed):

  52,24%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  12,06%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  10,06%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   3,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   1,90%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
...

Then I tried non eng image: perf record tesseract hebrew.png hebrew -l heb:

  27,79%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
  27,34%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
   4,40%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   3,98%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   3,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   2,36%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   2,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
...

@zdenop
Copy link
Contributor

zdenop commented Sep 13, 2016

Just for record for possible improvement in this issue: there was interesting information posted in scantailor project: OpenCL alone only brings ~2x speed-up. Another ~6x speed-up comes from multi-threaded processing.

@anant-pathak
Copy link

Hi @ychtioui I am newbie and saw your first comment that you are able to get pretty accurate results from Tesseract. For your image itself i am no table to get any results its telling: Can't recognize image. Can you plz provide the code snippet on how you are processing the image.
Thanks - Anant.

@amitdo
Copy link
Collaborator

amitdo commented Nov 28, 2016

@theraysmith
What do you use in the internal Google build, -O2 or -O3?

@paladini
Copy link

paladini commented Apr 8, 2017

I'm interested in the same answer, @amitdo . Can you answer the question, @theraysmith ? It really can help us :)

@stweil
Copy link
Member

stweil commented Apr 8, 2017

Don't expect much difference between -O2 and -O3. I tried different optimizations, and they only have small effects on the time needed for OCR of a page. Higher optimization levels can even result in slower code because the code gets larger (because of unfolding of loops), so CPU caches become less effective. It is much more important to write good code.

@theraysmith
Copy link
Contributor

theraysmith commented Apr 14, 2017 via email

@stweil
Copy link
Member

stweil commented Apr 15, 2017

The improvement by using -fopenmp is useful when you want "realtime" OCR – running OCR for a single page and waiting for the result. Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process.

For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

@amitdo
Copy link
Collaborator

amitdo commented Apr 15, 2017

Stefan, what about using OpenMP for training?

@stweil
Copy link
Member

stweil commented Apr 15, 2017

Yes, for training a single new model OpenMP could perhaps speed up the training process. Up to now, OpenMP is only used in ccmain/ and in lstm/. I don't know how much that part is used during training, and I never have run a performance evaluation for the training process (in fact I‌ have only run LSTM training once for Fraktur, and as I already said, it was not really successful).

@theraysmith
Copy link
Contributor

theraysmith commented Apr 17, 2017 via email

@xlight
Copy link

xlight commented Apr 19, 2017

can I set more than 4 threads for Trainning LSTM?

@theraysmith
Copy link
Contributor

theraysmith commented Apr 19, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Apr 19, 2017

What about machines that have only 2 cores?
Shouldn't the 'num_threads' lowered to 2 in that case?

@theraysmith
Copy link
Contributor

theraysmith commented Apr 19, 2017 via email

@stweil
Copy link
Member

stweil commented Jan 25, 2020

The Linux kernel and kernel parameters also have a significant effect on the performance of Tesseract (both for recognition and training). Especially the first kernels which tried to fix Spectre and similar CPU bugs make it really slow. I recently noticed that Tesseract with Debian GNU Linux (testing / bullseye) is faster when running in the Linux subsystem for Windows. Running on a Linux kernel with the default settings is slightly slower than running on the Windows kernel.

With the kernel parameters from https://make-linux-fast-again.com/ Tesseract gets faster by about 10 to 20 % and is then faster than in the Linux subsystem for Windows.

@PratapMehra
Copy link

PratapMehra commented May 16, 2020

@zdenop How to achieve AVX, AVX2, FMA or SSE optimization.

@stweil
Copy link
Member

stweil commented May 16, 2020

It is used automatically if your computer provides them.

@stweil
Copy link
Member

stweil commented May 17, 2020

For texts without inverted text, significant faster OCR is possible when tesseract is called with -c tessedit_do_invert=0, see timing results above.

@ViniciusLelis
Copy link

Is it possible to set -c tessedit_do_invert=0 in runtime or do we need to build Tesseract with this option?

@amitdo
Copy link
Collaborator

amitdo commented May 21, 2020

It's a runtime option:

tesseract in.png out -c tessedit_do_invert=0

@ViniciusLelis
Copy link

Are you aware of whether or not the pytesseract has that option available?

@amitdo
Copy link
Collaborator

amitdo commented May 21, 2020

I'm not familiar with pytesseract.

@stweil
Copy link
Member

stweil commented May 22, 2020

Are you aware of whether or not the pytesseract has that option available?

The answer is on the pytesseract homepage:

config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'

@skydev66
Copy link

Is there any way to use tesseract via multi threading on android project?

@ViniciusLelis
Copy link

I managed to get faster results by upgrading Tesseract from 4.x to 5.x (can't remember the exact versions)
Also found out that our production servers were using 32bit, so we installed the 64bits version instead.
Time to analysis went from 20+ seconds to 7~10 which is perfectly acceptable since we also added 2 more servers.

@amitdo
Copy link
Collaborator

amitdo commented Dec 26, 2021

Tesseract 5.0.0 should be faster than 4.1.x.

@zdenop, can you update your benchmarks above?

For the tessdata model, you can add two tests using just one of the ocr engines. test 1: oem 0 (legacy only), test 2: oem 1 (lstm only).

@stweil
Copy link
Member

stweil commented Dec 26, 2021

Timing test with lstm_squashed_test on Debian bullseye, AMD EPYC 7413, Tesseract Git main, -O2:

# clang, default kernel options, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (22778 ms)
[       OK ] LSTMTrainerTest.TestSquashed (22764 ms)

# g++, default kernel options, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (23722 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23739 ms)

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (22984 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23062 ms)

# g++, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (23834 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23708 ms)

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
[       OK ] LSTMTrainerTest.TestSquashed (22844 ms)
[       OK ] LSTMTrainerTest.TestSquashed (22963 ms)

So with a recent Linux kernel "optimized" kernel options no longer seem to have an effect on the performance.
Nor does OpenMP make that training test faster. It even has a huge negative effect because it consumes much CPU time:

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
time ./lstm_squashed_test
[...]
real	0m23.114s
user	0m23.049s
sys	0m0.064s

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
time ./lstm_squashed_test
[...]
real	0m22.972s
user	1m31.495s
sys	0m0.308s

Using -O3 has no effect im my test, but adding -ffast-mathincreases the performance further:

# clang, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (21793 ms)

@amitdo
Copy link
Collaborator

amitdo commented Dec 26, 2021

For OpenMP, you can try to limit the number of threads it uses to n_cpu_cores-1.

Edit: With your CPU, you can try to limit it to a small numbers of threads, let say 3, and then increase/decrease the number of threads.

@stweil
Copy link
Member

stweil commented Dec 26, 2021

The test was running on a CPU with 24 cores. Using more than one core always produces a huge waste of CPU time.

# OMP_THREAD_LIMIT=1
real	0m25.105s
user	0m25.048s
sys	0m0.056s

# OMP_THREAD_LIMIT=2
real	0m25.637s
user	0m51.032s
sys	0m0.188s

# OMP_THREAD_LIMIT=3
real	0m23.279s
user	1m9.493s
sys	0m0.288s

# OMP_THREAD_LIMIT=4 or larger
real	0m23.008s
user	1m31.521s
sys	0m0.348s

@wollmers
Copy link

Using more than 1 CPU in the same address space has always coordination overhead and more than ~4 is a complete waste. Boxes with 24 CPUs are more made to run VMs on it. Something like 2 x 6C/6T serving 24 VMs and 400 websites works (with disk IO as the bottleneck).

For task with 100% CPU I would first profile them to find hotspots or low hanging fruits. Maybe change to the much faster TensorFlow. Are there benchmarks, how much faster Tensorflow is?

Tuning code itself is more time consuming and in case of well crafted code you can get maybe something in the range of 10%.

@zdenop
Copy link
Contributor

zdenop commented Dec 27, 2021

@amitdo : what about creating wiki related to speed? IMO it would be more appropriate than discussing/updating 5 years old thread...

@amitdo
Copy link
Collaborator

amitdo commented Dec 27, 2021

@zdenop,

Wiki page or a page in tessdoc?

Benchmarks ?
Performace comparison ?

@zdenop
Copy link
Contributor

zdenop commented Jan 9, 2022

I started https://github.com/tesseract-ocr/tessdoc/blob/main/Benchmarks.md

Still missing several tests (4.1.3 with AWX, -c tessedit_do_invert=0, maybe different OEM, OCR quality...)

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2022

Thanks Zdenko.

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2022

Conclusions:

  • OpenMP in Tesseract is very inefficient.
  • Text recognition: 5.01 using a fast LSTM model with a CPU that supports AVX2 and without OpenMP is faster than 3.05 which uses the legacy engine.

@Freredaran
Copy link

Freredaran commented Feb 18, 2023

@stweil

If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests