Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improved performance for OCR with tessdata_best models #2106

Closed
wants to merge 0 commits into from

Conversation

noahmetzger
Copy link
Contributor

The improved performance is achieved by using float instead of double for the dotproduct. To minimize the accuracy loss we use the Kahan algorithm. By default double is used as before. To activate float use the parameter -c dotproduct_kahan_float_mode=true.

@stweil
Copy link
Member

stweil commented Dec 5, 2018

This is an early version of new code to speed up recognition and training. In our tests the time for OCR with tessdata_best models was typically reduced to 75 % of the original time while the OCR output remained unchanged.

Still missing:

  • accelerated training using float+Kahan
  • support for SSE (the current code uses slower C++ code as a fallback)

The AVX uses code from @RRZE-HPC who released it with a BSD license. I'm afraid that it is not compatible with Tesseract's Apache-2.0 license. @RRZE-HPC, could we reuse your code under Apache-2.0, too?

Comments on the code and performance results comparing OCR with double and float dot product are welcome.

@@ -14,6 +14,9 @@
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
// Copyright (c) 2017, RRZE-HPC Erlangen
Copy link

@syzer syzer Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to push your licence ? :)

Copy link
Member

@stweil stweil Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you read my comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request was sent as a request for comments (RFC) and is not ready to get merged. If we don't get a compatible license from RRZE-HPC Erlangen, we can rewrite that small part of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @RRZE-HPC because of the license issue. Could we reuse your code under Apache-2.0, too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try pinging individuals...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only a detail (AVX dot product with float) of this pull request. I can even talk to the people in Erlangen.

For SSE we will have to provide own code (maybe based on the assembler code generated by a highly optimizing compiler), and if needed we could replace the AVX code, too.

More important would be feedback on the performance on various platforms and whether using float instead of double changes the OCR results (it did not in our tests).

Noah is currently working on using float for the training part, too, so hopefully the training process can be accelerated a lot in the future.

@zdenop
Copy link
Contributor

zdenop commented Dec 18, 2018

FYI: there are some people with intention to implement cuda for tesseract...

@stweil
Copy link
Member

stweil commented Dec 18, 2018

I expect that CUDA or OpenCL will be faster with float instead of double, too, so such implementations can profit from the new code as well.

@stweil
Copy link
Member

stweil commented Jan 3, 2019

FYI: there are some people with intention to implement cuda for tesseract...

Tesseract is already prepared to build with Tensorflow which also supports CUDA.
I tried such a build, but it requires a huge number of hacks to fix dependencies, and up to now I did not finish that.

@zdenop
Copy link
Contributor

zdenop commented Jan 6, 2019

@noahmetzger @stweil : to push this forward: do you have some testing case? So builders can use it for testing easily ?

@stweil
Copy link
Member

stweil commented Jan 6, 2019

The BSD license includes this clause:

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

That would be a new requirement for all binary distributions of Tesseract. I don't think that is a technical problem, but I think it makes binary distributions more complicated as they are forced to add two license files.

@stweil
Copy link
Member

stweil commented Jan 6, 2019

do you have some testing case?

Just run Tesseract with LSTM and a "best" model on some of your images, first with the normal DOT product (-c dotproduct_kahan_float_mode=false or no such config variable), then with the float DOT product (-c dotproduct_kahan_float_mode=true) and compare the time needed (float should be faster) and the OCR result (should be identical).

Noah plans to extend the implementation next week – then training can also use the faster float dot product.

@vidiecan
Copy link

Our test results:

  • measured the absolute running times with only kahan on/off configuration change between runs;
  • tested on a set of tens of different images;
  • our application is more complex and the LSTM OCR runs "only" roughly 70% of the execution time;

Note: compare only seconds between the same runs because different arch. docker/native have different parallel settings etc.

Windows 10 / Ryzen 7 1700X / parallel execution

native run (MSVC) kahan on/off: 250 vs. 470 seconds
docker ubuntu (GCC) kahan on/off: 400  vs 580 seconds

Windows 10 / i7-2600 CPU

native kahan (MSVC) on/off: 600 vs. 1020 seconds

Others (preliminary)

Amazon Lambda - the performance slightly decreased (less than 10%) with kahan on. Needs more investigation.
Shippable builds - the performance decreased ~15% with kahan on. Needs more investigation.

@stweil
Copy link
Member

stweil commented Jan 16, 2019

Thank you, @vidiecan, for these first public timing results.

@Fahad-Alsaidi
Copy link

This seems interesting any update on this?

@noahmetzger noahmetzger force-pushed the master branch 2 times, most recently from 1dd6361 to 4bd76c1 Compare February 5, 2019 16:20
@stweil stweil force-pushed the master branch 2 times, most recently from 1653287 to e9bf2f1 Compare February 12, 2019 17:32
@stweil
Copy link
Member

stweil commented Feb 12, 2019

I updated the pull request (merge conflicts fixed).

@Shreeshrii
Copy link
Collaborator

Noah plans to extend the implementation next week – then training can also use the faster float dot product.

@stweil Have changes been made for training?

Comments on the code and performance results comparing OCR with double and float dot product are welcome.

Have the unlvtests been run with both options to compare the results?

@stweil
Copy link
Member

stweil commented Feb 14, 2019

Noah has published some initial work for faster training in a branch, but is still working on it.

I recently rebased and improved the current pull request and am now running the UNLV tests. It looks like the Wiki needs an update, and I also had to fix the test repository.

@Shreeshrii
Copy link
Collaborator

@stweil I have updated the wiki page to point to README for instructions to run the unlvtests. Please improve the instructions in https://github.com/tesseract-ocr/test/blob/master/unlvtests/README.md after your tests.

How do the results look for kahan on/off configuration for unlvtests?

@Shreeshrii
Copy link
Collaborator

I tested 15 images with Devanagari script using tessdata_best/san.traineddata and OMP_THREAD_LIMIT=1. I did 2 runs, first processing the files once and second processing them 10 times in a loop.

The results for both kahan on/off seem similar.

15 files - 1 time

*******************************
Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
*******************************
tesseract 4.0.0-313-gfc47
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
*******************************
*******************************  -c dotproduct_kahan_float_mode=false
*******************************

real    1m28.643s
user    1m27.324s
sys     0m0.628s
*******************************
*******************************  -c dotproduct_kahan_float_mode=true
*******************************

real    1m29.641s
user    1m28.276s
sys     0m0.596s
*******************************
DONE

15 files - 10 times

*******************************
Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
*******************************
tesseract 4.0.0-313-gfc47
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
*******************************
  -c dotproduct_kahan_float_mode=false
*******************************

real    14m45.163s
user    14m32.664s
sys     0m5.380s
*******************************
  -c dotproduct_kahan_float_mode=true
*******************************

real    14m59.043s
user    14m43.796s
sys     0m5.280s
*******************************
DONE

@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2019

Does the tested machine have AVX?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 16, 2019

Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018
ppc64le ppc64le ppc64le GNU/Linux

It is Power8 Little-endian VM from osuosl.org that I use for testing tesseract. I don't think it has AVX or SSE.

This guide provides information to help with porting C/C++ applications that both use GNU/GCC intrinsic functions and are based on the Intel SSE instruction set for x86, to the VMX (Vector Multimedia eXtension) or the VSX (Vector Scalar eXtension) instruction set for PowerPC®. This guide considers and compares Intel Streaming SIMD Extensions (SSE), Intel MMX, and AMD 3DNow! instruction sets with AltiVec with regards to GNU/GCC 4.8.2 built-ins and data types, available with the IBM Advance Toolchain for PowerLinux 7.0-1.

From https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Intel%20SSE%20to%20PowerPC%20AltiVec%20migration

@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2019

So in both of your tests run inside Power8 machine, a plain c++ code is used, without SIMD acceleration.

@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2019

A feature request for Noah and Stefan:

After you'll finish working on the float training code, consider adding a compile time option to disable all the 'double' code, keeping only the 'float+Kahan' code path.

@stweil stweil closed this Feb 17, 2019
@ghost ghost removed the review label Feb 17, 2019
@stweil
Copy link
Member

stweil commented Feb 17, 2019

Sorry, I accidentally removed all commits when I wanted to rebase this pull request. Waiting for @noahmetzger who is needed to fix this on Monday.

@amitdo
Copy link
Collaborator

amitdo commented Feb 17, 2019

@stweil
Copy link
Member

stweil commented Feb 17, 2019

@Shreeshrii, I have now also run some tests on Powerpc64. OpenMP seems to be nearly useless on that platform, but the performance is improved a lot by a unrolling the loop in the dot product functions:

# Double dot product (git master).
real	1m42.841s
user	1m42.152s
sys	0m0.670s

# Double dot product with unrolled loop.
real	1m9.087s
user	1m8.586s
sys	0m0.486s

# Float dot product with unrolled loop.
real	1m7.798s
user	1m7.183s
sys	0m0.600s

The unrolled loop looks like in this example:

// Computes and returns the dot product of the two n-vectors u and v.
double DotProductNative(const double* u, const double* v, int n) {
  double total = 0.0;
  const unsigned div = n / 4;
  const unsigned rem = n % 4;
  for (unsigned k = 0; k < div; ++k) {
    total += *u++ * *v++;
    total += *u++ * *v++;
    total += *u++ * *v++;
    total += *u++ * *v++;
  }
  for (unsigned k = 0; k < rem; ++k) {
    total += *u++ * *v++;
  }
  return total;
}

@Shreeshrii
Copy link
Collaborator

@stweil Thanks. Is that something that could be implemented based on arch?

@stweil
Copy link
Member

stweil commented Feb 18, 2019

Maybe. I was curious and also did the same test on ARM where the unrolled loop did neither harm nor improve the performance. Perhaps the compiler option -funroll-loops would get the same effect with the current code. I'll test that today.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 18, 2019 via email

@stweil
Copy link
Member

stweil commented Feb 18, 2019

Yes, it can make a difference, but again that depends on the architecture, so you have to try all variants. Depending on the host / compiler combination, generic and native can be identical. I just made a test on x86_64 with the unrolled loop code. There -c dotproduct=native is faster than -c dotproduct=sse and nearly as fast as -c dotproduct=avx. When I used -funroll-loops on the original code, the loop was unrolled, too, but the resulting code was slower.

@amitdo
Copy link
Collaborator

amitdo commented Feb 18, 2019

@stweil,
Try using t1, t2, t3, t4 instead of total+=, and then do total += t1 + t2+ t3 + t4.

@stweil
Copy link
Member

stweil commented Feb 18, 2019

Yes, that's how the optimized code for AVX computes it. Indeed it seems to improve the timing even more for x86_64, so it becomes difficult to tell whether the special AVX code still is faster.

I also tested unrolling with 8 instead of 4 products per loop iteration, but that did not change the time in my test. 2 products per loop already improves the performance, but 4 products per loop was better for x86_64.

@amitdo
Copy link
Collaborator

amitdo commented Feb 18, 2019

The 4 vars method take advantage of Out of Order Execution support in the CPU.

@stweil
Copy link
Member

stweil commented Feb 19, 2019

The latest code is at https://github.com/stweil/tesseract/tree/floatModeRecognition. I added an implementation for SSE with float. Timing results on Linux for test/testing/phototest.tif:

  • fast: 1.45 s
  • best: 2.53 s (SSE with double)
  • best: 1.75 s (SSE with float)

@stweil
Copy link
Member

stweil commented Feb 20, 2019

Pull request #2106 has a new implementation of the AVX dot product for vectors of double values which uses unrolling with out of order execution.

@amitdo
Copy link
Collaborator

amitdo commented Feb 20, 2019

Pull request #2106 has a new implementation of the AVX dot product for vectors of double values

You meant PR #2257...

@stweil
Copy link
Member

stweil commented Feb 20, 2019

Yes, thank you.

@amitdo
Copy link
Collaborator

amitdo commented Nov 23, 2020

@stweil,

What happened with this double to float conversion effort?

The training part of this conversion is more interesting than the inference part. For inference we have intsimdmatrixavx2.

@stweil
Copy link
Member

stweil commented Nov 24, 2020

I agree. As far as I know Tesseract is the only OCR software which uses double, all others use float. @noahmetzger implemented the inference part, but did not have enough time for the training part. He is now working on other projects.

For the inference part our results with float did not differ from double, even without the Kahan algorithm which we first used to reduce errors in the dot product. So I expect that float would work for training as well and nearly double the speed which is highly desirable.

One of the challenges (which Noah did not address) is improving the class structure to support int, float and double in separate classes or class templates. Currently Tesseract has several classes which use int_mode_ to determine how they are used. The float implementation simply added floatcomponents to those classes. I think that makes the classes ugly and unnecessarily large (which costs memory and maybe also performance). Therefore I'd prefer new classes without the need for int_mode_.

A simpler way to get float support would stick to the current classes, but add a configuration option which allows building either float or a double Tesseract binaries. That would only require new implementations of serialization and deserialization and of course for the AVX2 dotproduct. Tesseract already has support for reading an "old" traineddata file format using float, so maybe Ray started with float (I have no idea why he used double in the released code).

@amitdo
Copy link
Collaborator

amitdo commented Nov 24, 2020

A simpler way to get float support would stick to the current classes, but add a configuration option which allows building either float or a double

Yes, I believe it's the right way to go. BTW, clstm has this option.

@amitdo amitdo added the RFC label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants