-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Improved performance for OCR with tessdata_best models #2106
Conversation
This is an early version of new code to speed up recognition and training. In our tests the time for OCR with tessdata_best models was typically reduced to 75 % of the original time while the OCR output remained unchanged. Still missing:
The AVX uses code from @RRZE-HPC who released it with a BSD license. I'm afraid that it is not compatible with Tesseract's Apache-2.0 license. @RRZE-HPC, could we reuse your code under Apache-2.0, too? Comments on the code and performance results comparing OCR with double and float dot product are welcome. |
src/arch/dotproductavx.cpp
Outdated
@@ -14,6 +14,9 @@ | |||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
// See the License for the specific language governing permissions and | |||
// limitations under the License. | |||
// | |||
// Copyright (c) 2017, RRZE-HPC Erlangen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want to push your licence ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you read my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request was sent as a request for comments (RFC) and is not ready to get merged. If we don't get a compatible license from RRZE-HPC Erlangen, we can rewrite that small part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @RRZE-HPC because of the license issue. Could we reuse your code under Apache-2.0, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try pinging individuals...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only a detail (AVX dot product with float) of this pull request. I can even talk to the people in Erlangen.
For SSE we will have to provide own code (maybe based on the assembler code generated by a highly optimizing compiler), and if needed we could replace the AVX code, too.
More important would be feedback on the performance on various platforms and whether using float instead of double changes the OCR results (it did not in our tests).
Noah is currently working on using float for the training part, too, so hopefully the training process can be accelerated a lot in the future.
FYI: there are some people with intention to implement cuda for tesseract... |
I expect that CUDA or OpenCL will be faster with float instead of double, too, so such implementations can profit from the new code as well. |
Tesseract is already prepared to build with Tensorflow which also supports CUDA. |
@noahmetzger @stweil : to push this forward: do you have some testing case? So builders can use it for testing easily ? |
The BSD license includes this clause:
That would be a new requirement for all binary distributions of Tesseract. I don't think that is a technical problem, but I think it makes binary distributions more complicated as they are forced to add two license files. |
Just run Tesseract with LSTM and a "best" model on some of your images, first with the normal DOT product ( Noah plans to extend the implementation next week – then training can also use the faster float dot product. |
Our test results:
Note: compare only seconds between the same runs because different arch. docker/native have different parallel settings etc. Windows 10 / Ryzen 7 1700X / parallel execution
Windows 10 / i7-2600 CPU
Others (preliminary)Amazon Lambda - the performance slightly decreased (less than 10%) with kahan on. Needs more investigation. |
Thank you, @vidiecan, for these first public timing results. |
This seems interesting any update on this? |
1dd6361
to
4bd76c1
Compare
1653287
to
e9bf2f1
Compare
I updated the pull request (merge conflicts fixed). |
@stweil Have changes been made for training?
Have the unlvtests been run with both options to compare the results? |
Noah has published some initial work for faster training in a branch, but is still working on it. I recently rebased and improved the current pull request and am now running the UNLV tests. It looks like the Wiki needs an update, and I also had to fix the test repository. |
@stweil I have updated the wiki page to point to README for instructions to run the unlvtests. Please improve the instructions in https://github.com/tesseract-ocr/test/blob/master/unlvtests/README.md after your tests. How do the results look for kahan on/off configuration for unlvtests? |
I tested 15 images with Devanagari script using The results for both kahan on/off seem similar. 15 files - 1 time
15 files - 10 times
|
Does the tested machine have AVX? |
Linux tesseract-ocr 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:57 UTC 2018 It is Power8 Little-endian VM from osuosl.org that I use for testing tesseract. I don't think it has AVX or SSE.
|
So in both of your tests run inside Power8 machine, a plain c++ code is used, without SIMD acceleration. |
A feature request for Noah and Stefan: After you'll finish working on the float training code, consider adding a compile time option to disable all the 'double' code, keeping only the 'float+Kahan' code path. |
Sorry, I accidentally removed all commits when I wanted to rebase this pull request. Waiting for @noahmetzger who is needed to fix this on Monday. |
@Shreeshrii, I have now also run some tests on Powerpc64. OpenMP seems to be nearly useless on that platform, but the performance is improved a lot by a unrolling the loop in the dot product functions:
The unrolled loop looks like in this example:
|
@stweil Thanks. Is that something that could be implemented based on arch? |
Maybe. I was curious and also did the same test on ARM where the unrolled loop did neither harm nor improve the performance. Perhaps the compiler option |
-c dotproduct=FUNCTION where FUNCTION can be one of those values:
* auto selection based on detected hardware (default)
* generic C++ code with default compiler options
* native C++ code optimized for build host
Should I be using `native` instead of the default value? Would that
make a difference?
…On Mon, Feb 18, 2019 at 11:58 AM Stefan Weil ***@***.***> wrote:
Maybe. I was curious and also did the same test on ARM where the unrolled
loop did neither harm nor improve the performance. Perhaps the compiler
option -funroll-loops would get the same effect with the current code.
I'll test that today.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o8ejA_fnb6OnuS1kGNT689PXKNLrks5vOkf_gaJpZM4ZDDTO>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Yes, it can make a difference, but again that depends on the architecture, so you have to try all variants. Depending on the host / compiler combination, |
@stweil, |
Yes, that's how the optimized code for AVX computes it. Indeed it seems to improve the timing even more for x86_64, so it becomes difficult to tell whether the special AVX code still is faster. I also tested unrolling with 8 instead of 4 products per loop iteration, but that did not change the time in my test. 2 products per loop already improves the performance, but 4 products per loop was better for x86_64. |
The 4 vars method take advantage of Out of Order Execution support in the CPU. |
The latest code is at https://github.com/stweil/tesseract/tree/floatModeRecognition. I added an implementation for SSE with float. Timing results on Linux for test/testing/phototest.tif:
|
Pull request #2106 has a new implementation of the AVX dot product for vectors of double values which uses unrolling with out of order execution. |
Yes, thank you. |
What happened with this double to float conversion effort? The training part of this conversion is more interesting than the inference part. For inference we have intsimdmatrixavx2. |
I agree. As far as I know Tesseract is the only OCR software which uses For the inference part our results with One of the challenges (which Noah did not address) is improving the class structure to support A simpler way to get |
Yes, I believe it's the right way to go. BTW, clstm has this option. |
The improved performance is achieved by using float instead of double for the dotproduct. To minimize the accuracy loss we use the Kahan algorithm. By default double is used as before. To activate float use the parameter
-c dotproduct_kahan_float_mode=true
.