RFC: Improved performance for OCR with tessdata_best models #2106

noahmetzger · 2018-12-05T16:10:09Z

The improved performance is achieved by using float instead of double for the dotproduct. To minimize the accuracy loss we use the Kahan algorithm. By default double is used as before. To activate float use the parameter -c dotproduct_kahan_float_mode=true.

stweil · 2018-12-05T16:25:20Z

This is an early version of new code to speed up recognition and training. In our tests the time for OCR with tessdata_best models was typically reduced to 75 % of the original time while the OCR output remained unchanged.

Still missing:

accelerated training using float+Kahan
support for SSE (the current code uses slower C++ code as a fallback)

The AVX uses code from @RRZE-HPC who released it with a BSD license. I'm afraid that it is not compatible with Tesseract's Apache-2.0 license. @RRZE-HPC, could we reuse your code under Apache-2.0, too?

Comments on the code and performance results comparing OCR with double and float dot product are welcome.

syzer · 2018-12-13T15:11:28Z

src/arch/dotproductavx.cpp

@@ -14,6 +14,9 @@
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
+//
+// Copyright (c) 2017, RRZE-HPC Erlangen


Do you really want to push your licence ? :)

Did you read my comment?

This pull request was sent as a request for comments (RFC) and is not ready to get merged. If we don't get a compatible license from RRZE-HPC Erlangen, we can rewrite that small part of the code.

Ping @RRZE-HPC because of the license issue. Could we reuse your code under Apache-2.0, too?

try pinging individuals...

That's only a detail (AVX dot product with float) of this pull request. I can even talk to the people in Erlangen.

For SSE we will have to provide own code (maybe based on the assembler code generated by a highly optimizing compiler), and if needed we could replace the AVX code, too.

More important would be feedback on the performance on various platforms and whether using float instead of double changes the OCR results (it did not in our tests).

Noah is currently working on using float for the training part, too, so hopefully the training process can be accelerated a lot in the future.

zdenop · 2018-12-18T18:56:13Z

FYI: there are some people with intention to implement cuda for tesseract...

stweil · 2018-12-18T19:05:27Z

I expect that CUDA or OpenCL will be faster with float instead of double, too, so such implementations can profit from the new code as well.

stweil · 2019-01-03T12:37:13Z

FYI: there are some people with intention to implement cuda for tesseract...

Tesseract is already prepared to build with Tensorflow which also supports CUDA.
I tried such a build, but it requires a huge number of hacks to fix dependencies, and up to now I did not finish that.

zdenop · 2019-01-06T15:43:40Z

Regarding licence: You may include (extend) BSD licensed code in an Apache v2.0 licensed code base, including source code managed by the Apache Software Foundation.

zdenop · 2019-01-06T15:45:33Z

@noahmetzger @stweil : to push this forward: do you have some testing case? So builders can use it for testing easily ?

stweil · 2019-01-06T16:41:53Z

The BSD license includes this clause:

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

That would be a new requirement for all binary distributions of Tesseract. I don't think that is a technical problem, but I think it makes binary distributions more complicated as they are forced to add two license files.

stweil · 2019-01-06T16:49:45Z

do you have some testing case?

Just run Tesseract with LSTM and a "best" model on some of your images, first with the normal DOT product (-c dotproduct_kahan_float_mode=false or no such config variable), then with the float DOT product (-c dotproduct_kahan_float_mode=true) and compare the time needed (float should be faster) and the OCR result (should be identical).

Noah plans to extend the implementation next week – then training can also use the faster float dot product.

vidiecan · 2019-01-16T11:28:46Z

Our test results:

measured the absolute running times with only kahan on/off configuration change between runs;
tested on a set of tens of different images;
our application is more complex and the LSTM OCR runs "only" roughly 70% of the execution time;

Note: compare only seconds between the same runs because different arch. docker/native have different parallel settings etc.

Windows 10 / Ryzen 7 1700X / parallel execution

native run (MSVC) kahan on/off: 250 vs. 470 seconds

docker ubuntu (GCC) kahan on/off: 400  vs 580 seconds

Windows 10 / i7-2600 CPU

native kahan (MSVC) on/off: 600 vs. 1020 seconds

Others (preliminary)

Amazon Lambda - the performance slightly decreased (less than 10%) with kahan on. Needs more investigation.
Shippable builds - the performance decreased ~15% with kahan on. Needs more investigation.

stweil · 2019-01-16T12:16:28Z

Thank you, @vidiecan, for these first public timing results.

Fahad-Alsaidi · 2019-02-01T07:19:43Z

This seems interesting any update on this?

stweil · 2019-02-12T17:33:34Z

I updated the pull request (merge conflicts fixed).

Shreeshrii · 2019-02-14T03:12:32Z

Noah plans to extend the implementation next week – then training can also use the faster float dot product.

@stweil Have changes been made for training?

Comments on the code and performance results comparing OCR with double and float dot product are welcome.

Have the unlvtests been run with both options to compare the results?

stweil · 2019-02-14T10:41:49Z

Noah has published some initial work for faster training in a branch, but is still working on it.

I recently rebased and improved the current pull request and am now running the UNLV tests. It looks like the Wiki needs an update, and I also had to fix the test repository.

Shreeshrii · 2019-02-16T03:17:41Z

@stweil I have updated the wiki page to point to README for instructions to run the unlvtests. Please improve the instructions in https://github.com/tesseract-ocr/test/blob/master/unlvtests/README.md after your tests.

How do the results look for kahan on/off configuration for unlvtests?

Shreeshrii · 2019-02-16T05:08:45Z

I tested 15 images with Devanagari script using tessdata_best/san.traineddata and OMP_THREAD_LIMIT=1. I did 2 runs, first processing the files once and second processing them 10 times in a loop.

The results for both kahan on/off seem similar.

15 files - 1 time

*******************************
Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
*******************************
tesseract 4.0.0-313-gfc47
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
*******************************
*******************************  -c dotproduct_kahan_float_mode=false
*******************************

real    1m28.643s
user    1m27.324s
sys     0m0.628s
*******************************
*******************************  -c dotproduct_kahan_float_mode=true
*******************************

real    1m29.641s
user    1m28.276s
sys     0m0.596s
*******************************
DONE

15 files - 10 times

*******************************
Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
*******************************
tesseract 4.0.0-313-gfc47
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
*******************************
  -c dotproduct_kahan_float_mode=false
*******************************

real    14m45.163s
user    14m32.664s
sys     0m5.380s
*******************************
  -c dotproduct_kahan_float_mode=true
*******************************

real    14m59.043s
user    14m43.796s
sys     0m5.280s
*******************************
DONE

amitdo · 2019-02-16T13:42:26Z

Does the tested machine have AVX?

Shreeshrii · 2019-02-16T14:00:42Z

Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018
ppc64le ppc64le ppc64le GNU/Linux

It is Power8 Little-endian VM from osuosl.org that I use for testing tesseract. I don't think it has AVX or SSE.

This guide provides information to help with porting C/C++ applications that both use GNU/GCC intrinsic functions and are based on the Intel SSE instruction set for x86, to the VMX (Vector Multimedia eXtension) or the VSX (Vector Scalar eXtension) instruction set for PowerPC®. This guide considers and compares Intel Streaming SIMD Extensions (SSE), Intel MMX, and AMD 3DNow! instruction sets with AltiVec with regards to GNU/GCC 4.8.2 built-ins and data types, available with the IBM Advance Toolchain for PowerLinux 7.0-1.

From https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Intel%20SSE%20to%20PowerPC%20AltiVec%20migration

amitdo · 2019-02-16T14:30:31Z

So in both of your tests run inside Power8 machine, a plain c++ code is used, without SIMD acceleration.

amitdo · 2019-02-16T14:54:12Z

A feature request for Noah and Stefan:

After you'll finish working on the float training code, consider adding a compile time option to disable all the 'double' code, keeping only the 'float+Kahan' code path.

stweil · 2019-02-17T09:00:46Z

Sorry, I accidentally removed all commits when I wanted to rebase this pull request. Waiting for @noahmetzger who is needed to fix this on Monday.

amitdo · 2019-02-17T12:12:16Z

https://github.com/noahmetzger/tesseract/commits/floatModeRecognition

stweil · 2019-02-17T21:14:25Z

@Shreeshrii, I have now also run some tests on Powerpc64. OpenMP seems to be nearly useless on that platform, but the performance is improved a lot by a unrolling the loop in the dot product functions:

# Double dot product (git master).
real	1m42.841s
user	1m42.152s
sys	0m0.670s

# Double dot product with unrolled loop.
real	1m9.087s
user	1m8.586s
sys	0m0.486s

# Float dot product with unrolled loop.
real	1m7.798s
user	1m7.183s
sys	0m0.600s

The unrolled loop looks like in this example:

// Computes and returns the dot product of the two n-vectors u and v.
double DotProductNative(const double* u, const double* v, int n) {
  double total = 0.0;
  const unsigned div = n / 4;
  const unsigned rem = n % 4;
  for (unsigned k = 0; k < div; ++k) {
    total += *u++ * *v++;
    total += *u++ * *v++;
    total += *u++ * *v++;
    total += *u++ * *v++;
  }
  for (unsigned k = 0; k < rem; ++k) {
    total += *u++ * *v++;
  }
  return total;
}

Shreeshrii · 2019-02-18T03:34:30Z

@stweil Thanks. Is that something that could be implemented based on arch?

stweil · 2019-02-18T06:28:00Z

Maybe. I was curious and also did the same test on ARM where the unrolled loop did neither harm nor improve the performance. Perhaps the compiler option -funroll-loops would get the same effect with the current code. I'll test that today.

Shreeshrii · 2019-02-18T07:26:08Z

-c dotproduct=FUNCTION where FUNCTION can be one of those values: * auto selection based on detected hardware (default) * generic C++ code with default compiler options * native C++ code optimized for build host Should I be using `native` instead of the default value? Would that make a difference?

…

On Mon, Feb 18, 2019 at 11:58 AM Stefan Weil ***@***.***> wrote: Maybe. I was curious and also did the same test on ARM where the unrolled loop did neither harm nor improve the performance. Perhaps the compiler option -funroll-loops would get the same effect with the current code. I'll test that today. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2106 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o8ejA_fnb6OnuS1kGNT689PXKNLrks5vOkf_gaJpZM4ZDDTO> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

stweil · 2019-02-18T07:39:15Z

Yes, it can make a difference, but again that depends on the architecture, so you have to try all variants. Depending on the host / compiler combination, generic and native can be identical. I just made a test on x86_64 with the unrolled loop code. There -c dotproduct=native is faster than -c dotproduct=sse and nearly as fast as -c dotproduct=avx. When I used -funroll-loops on the original code, the loop was unrolled, too, but the resulting code was slower.

amitdo · 2019-02-18T10:25:12Z

@stweil,
Try using t1, t2, t3, t4 instead of total+=, and then do total += t1 + t2+ t3 + t4.

stweil · 2019-02-18T10:47:44Z

Yes, that's how the optimized code for AVX computes it. Indeed it seems to improve the timing even more for x86_64, so it becomes difficult to tell whether the special AVX code still is faster.

I also tested unrolling with 8 instead of 4 products per loop iteration, but that did not change the time in my test. 2 products per loop already improves the performance, but 4 products per loop was better for x86_64.

amitdo · 2019-02-18T11:54:24Z

The 4 vars method take advantage of Out of Order Execution support in the CPU.

stweil · 2019-02-19T14:45:22Z

The latest code is at https://github.com/stweil/tesseract/tree/floatModeRecognition. I added an implementation for SSE with float. Timing results on Linux for test/testing/phototest.tif:

fast: 1.45 s
best: 2.53 s (SSE with double)
best: 1.75 s (SSE with float)

stweil · 2019-02-20T16:57:04Z

Pull request #2106 has a new implementation of the AVX dot product for vectors of double values which uses unrolling with out of order execution.

amitdo · 2019-02-20T20:37:32Z

Pull request #2106 has a new implementation of the AVX dot product for vectors of double values

You meant PR #2257...

stweil · 2019-02-20T20:42:21Z

Yes, thank you.

amitdo · 2020-11-23T21:55:04Z

@stweil,

What happened with this double to float conversion effort?

The training part of this conversion is more interesting than the inference part. For inference we have intsimdmatrixavx2.

stweil · 2020-11-24T06:24:48Z

I agree. As far as I know Tesseract is the only OCR software which uses double, all others use float. @noahmetzger implemented the inference part, but did not have enough time for the training part. He is now working on other projects.

For the inference part our results with float did not differ from double, even without the Kahan algorithm which we first used to reduce errors in the dot product. So I expect that float would work for training as well and nearly double the speed which is highly desirable.

One of the challenges (which Noah did not address) is improving the class structure to support int, float and double in separate classes or class templates. Currently Tesseract has several classes which use int_mode_ to determine how they are used. The float implementation simply added floatcomponents to those classes. I think that makes the classes ugly and unnecessarily large (which costs memory and maybe also performance). Therefore I'd prefer new classes without the need for int_mode_.

A simpler way to get float support would stick to the current classes, but add a configuration option which allows building either float or a double Tesseract binaries. That would only require new implementations of serialization and deserialization and of course for the AVX2 dotproduct. Tesseract already has support for reading an "old" traineddata file format using float, so maybe Ray started with float (I have no idea why he used double in the released code).

amitdo · 2020-11-24T17:00:13Z

A simpler way to get float support would stick to the current classes, but add a configuration option which allows building either float or a double

Yes, I believe it's the right way to go. BTW, clstm has this option.

noahmetzger force-pushed the master branch from baf9d7f to aada186 Compare December 10, 2018 08:24

ghost assigned stweil Dec 10, 2018

ghost added the review label Dec 10, 2018

syzer reviewed Dec 13, 2018

View reviewed changes

stweil added performance awaiting feedback labels Dec 14, 2018

stweil force-pushed the master branch from 630699a to 4bd76c1 Compare January 3, 2019 12:34

noahmetzger force-pushed the master branch 2 times, most recently from 1dd6361 to 4bd76c1 Compare February 5, 2019 16:20

stweil force-pushed the master branch 2 times, most recently from 1653287 to e9bf2f1 Compare February 12, 2019 17:32

stweil closed this Feb 17, 2019

stweil force-pushed the master branch from e9bf2f1 to ddea230 Compare February 17, 2019 07:40

ghost removed the review label Feb 17, 2019

Shreeshrii mentioned this pull request Nov 24, 2019

Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

Closed

amitdo added the RFC label Mar 21, 2021

RFC: Improved performance for OCR with tessdata_best models #2106

RFC: Improved performance for OCR with tessdata_best models #2106

Conversation

noahmetzger commented Dec 5, 2018

stweil commented Dec 5, 2018

syzer Dec 13, 2018 • edited Loading

Choose a reason for hiding this comment

stweil Dec 13, 2018 • edited Loading

Choose a reason for hiding this comment

stweil Dec 13, 2018

Choose a reason for hiding this comment

stweil Dec 13, 2018

Choose a reason for hiding this comment

amitdo Dec 18, 2018

Choose a reason for hiding this comment

stweil Dec 18, 2018

Choose a reason for hiding this comment

zdenop commented Dec 18, 2018

stweil commented Dec 18, 2018

stweil commented Jan 3, 2019

zdenop commented Jan 6, 2019

zdenop commented Jan 6, 2019

stweil commented Jan 6, 2019

stweil commented Jan 6, 2019

vidiecan commented Jan 16, 2019

Windows 10 / Ryzen 7 1700X / parallel execution

Windows 10 / i7-2600 CPU

Others (preliminary)

stweil commented Jan 16, 2019

Fahad-Alsaidi commented Feb 1, 2019

stweil commented Feb 12, 2019 • edited Loading

Shreeshrii commented Feb 14, 2019

stweil commented Feb 14, 2019

Shreeshrii commented Feb 16, 2019

Shreeshrii commented Feb 16, 2019

15 files - 1 time

15 files - 10 times

amitdo commented Feb 16, 2019

Shreeshrii commented Feb 16, 2019 • edited Loading

amitdo commented Feb 16, 2019

amitdo commented Feb 16, 2019

stweil commented Feb 17, 2019

amitdo commented Feb 17, 2019

stweil commented Feb 17, 2019

Shreeshrii commented Feb 18, 2019

stweil commented Feb 18, 2019

Shreeshrii commented Feb 18, 2019 via email

stweil commented Feb 18, 2019 • edited Loading

amitdo commented Feb 18, 2019

stweil commented Feb 18, 2019 • edited Loading

amitdo commented Feb 18, 2019

stweil commented Feb 19, 2019

stweil commented Feb 20, 2019 • edited Loading

amitdo commented Feb 20, 2019

stweil commented Feb 20, 2019

amitdo commented Nov 23, 2020

stweil commented Nov 24, 2020

amitdo commented Nov 24, 2020

syzer Dec 13, 2018 •

edited

Loading

stweil Dec 13, 2018 •

edited

Loading

stweil commented Feb 12, 2019 •

edited

Loading

Shreeshrii commented Feb 16, 2019 •

edited

Loading

stweil commented Feb 18, 2019 •

edited

Loading

stweil commented Feb 18, 2019 •

edited

Loading

stweil commented Feb 20, 2019 •

edited

Loading