Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. #2611

Closed
ripefig opened this issue Aug 11, 2019 · 52 comments
Closed

Comments

@ripefig
Copy link

ripefig commented Aug 11, 2019

Environment

  • Tesseract Version:

> tesseract -v

tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

tesseract-snap -v

tesseract 5.0.0-alpha-335-gae02
 leptonica-1.74.2
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

tesseract-ocr-eng : 1:4.00~git30-7274cfa-1

I used the training data from the ubuntu repos for both tesseract and tesseract-snap , since no data is provided with the snap.

  • Platform:
    Operating System: Kubuntu 19.04
    KDE Plasma Version: 5.15.4
    KDE Frameworks Version: 5.56.0
    Qt Version: 5.12.2
    Kernel Version: 5.0.0-21-generic
    OS Type: 64-bit
    Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz
    Memory: 11.6 GiB of RAM

Current Behavior:

It takes over a minute of 100% CPU load to scan an image (directly below) with two sentences :

62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093

results for tesseract 4:
> time tesseract -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m9.096s
user    3m7.484s
sys     0m0.335s

Tesseract 5:

> time tesseract-snap -l eng 62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1

Tesseract Open Source OCR Engine v5.0.0-alpha-335-gae02 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    1m13.585s
user    3m16.104s

I tried to OCR a one page doc, but I had to exit the psenterocess. It would probably take one hour of full CPU load.Unfortunately I don't have Tesseract 3 to compare, but I remember using it in an OCR screenshotting script it felt as fast as regular copy and paste, so definitely under two seconds for this block of text.

Expected Behavior:

It shouldn't take this long to scan two sentences.

Suggested Fix

Disable multithreading by default until its fixed.

@ripefig ripefig changed the title Tesseract 4 and 5 is about 200 times slower than 3 on my Linux system. Tesseract 4 and 5 is about 100-200 times slower than 3 on my Linux system. Aug 12, 2019
@ripefig ripefig changed the title Tesseract 4 and 5 is about 100-200 times slower than 3 on my Linux system. Tesseract 4 and 5 is about 100-150 times slower than 3 on my Linux system. Aug 12, 2019
@ripefig
Copy link
Author

ripefig commented Aug 12, 2019

The solution is to set OMP_THREAD_LIMIT=1
Shouldn't multithreading be disabled by default until it's fixed?

#898

@stweil
Copy link
Member

stweil commented Aug 12, 2019

I cannot reproduce your timing results on a recent Debian system:

$ tesseract --version
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
$ time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0,209s
user	0m0,497s
sys	0m0,024s

With OMP_THREAD_LIMIT=1, it takes a little longer:

$ time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0,255s
user	0m0,247s
sys	0m0,008s

@Shreeshrii
Copy link
Collaborator

Is the test image available somewhere? I would like to try it on a non-AVX system.

@Shreeshrii

This comment has been minimized.

@stweil
Copy link
Member

stweil commented Aug 12, 2019

Is the test image available somewhere? I would like to try it on a non-AVX system.

It's given in the initial report: https://user-images.githubusercontent.com/45201036/62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png.

Even without AVX it should not take more than a second.

@Shreeshrii
Copy link
Collaborator

Thanks @stweil .

Here are the results on my system - Linux tesseract-ocr 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:54:50 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

ubuntu@tesseract-ocr:~/TEST$  time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-332-gb839 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m15.212s
user    0m9.774s
sys     0m0.186s

ubuntu@tesseract-ocr:~/TEST$ OMP_THREAD_LIMIT=1 time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-332-gb839 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
3.09user 0.02system 0:03.11elapsed 99%CPU (0avgtext+0avgdata 88064maxresident)k
0inputs+128outputs (0major+2175minor)pagefaults 0swaps
ubuntu@tesseract-ocr:~/TEST$ tesseract -v
tesseract 5.0.0-alpha-332-gb839
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

IBM POWER8 - 8 CPU, 24 GB RAM

@ripefig
Copy link
Author

ripefig commented Aug 12, 2019

@Shreeshrii seems like you're have the same problem, given that your system is much more powerful.

@stweil
Copy link
Member

stweil commented Aug 13, 2019

Test result on an ARM system:

# tessdata_fast
real	0m1.864s
user	0m4.778s
sys	0m0.147s

With export OMP_THREAD_LIMIT=1:

# tessdata_fast
real	0m2.078s
user	0m1.950s
sys	0m0.099s

The results for 4.0.0 and latest Git master are similar.

@stweil
Copy link
Member

stweil commented Aug 13, 2019

@ripefig, your results could be explained if Tesseract cannot get 4 CPU cores. On my ARM system which has 4 cores I get a faster result with export OMP_THREAD_LIMIT=2:

# tessdata_fast
real	0m1.426s
user	0m1.969s
sys	0m0.159s

That also reduces the huge overhead in the user time which occurs with 4 threads.

@ripefig
Copy link
Author

ripefig commented Aug 13, 2019

$ time OMP_THREAD_LIMIT=1  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m0.366s
user    0m0.346s
sys     0m0.012s

$ time OMP_THREAD_LIMIT=3  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m1.933s
user    0m3.652s
sys     0m0.037s


$ time OMP_THREAD_LIMIT=2  tesseract -l eng  62771160-3bf82880-baa5-11e9-8d39-4d9c4381a093.png 1
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m0.732s
user    0m0.757s
sys     0m0.032s

@ripefig
Copy link
Author

ripefig commented Aug 15, 2019

@stweil Is there any solution? Maybe limit the default number of cores to 1 (or max cores - 1) until Tesseract can reliably work with all cores? Seems like it's completely broken for a lot of users and problem has persisted for years. This also breaks all the software that uses tesseract.

@stweil
Copy link
Member

stweil commented Aug 15, 2019

The right solution depends on your hardware (number of cores, memory interface) and your use case: on some hardware using more than one core results in faster OCR (see my results above), and training is much faster with 4 cores. It is always possible to either set OMP_THREAD_LIMIT or to build your own binary without multithreading. Without that, Tesseract is not "completely broken" or unreliable, but simply slow. I know that is not nice. The Windows binaries from UB Mannheim are therefore built without multithreading.

Because there are acceptable solutions for the speed issue, my current first priority is improving quality, not looking how to improve multithreading. If you or someone else finds a better solution for multithreading, a pull request would be welcomed.

@ripefig
Copy link
Author

ripefig commented Aug 15, 2019

Out of the box, it takes about one hour to OCR a single page of text. It would take one month to OCR a textbook, and the CPU would probably fry. I think most users would consider this "completely broken," in the sense of not being usable.

The issue affects both AVX and non-AVX systems. The program is capable of cutting down times by two orders of magnitude in both cases, as demonstrated in this thread. Why not just limit the core count by default until the issue is fixed?

Of course, one could argue that it's up to application developers to make sure tesseract works on the target system. (I just tried a few OCR apps and most of them work fine - so it looks like they are fixing it on their end somehow).

@stweil
Copy link
Member

stweil commented Aug 15, 2019

That's simply not true. It is slow on your notebook. On my six year old notebook (no AVX, 4 x Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz) the official Debian package works pretty good:

$ time tesseract 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real	0m0.327s
user	0m0.796s
sys	0m0.024s

@ripefig
Copy link
Author

ripefig commented Aug 15, 2019

I didn't say it affects all systems, but it's frequent enough to warrant some kind of change. Multicore might result in 5-30% speed improvement in certain cases but it can also result in a 10000% speed decrease on many systems. Intel® Core™ i7-4600U CPU isn't exactly an exotic chipset.

Perhaps you're saying this is an Ubuntu issue?

jbarlow83 pushed a commit to ocrmypdf/OCRmyPDF that referenced this issue Oct 20, 2019
Based on a user suggestion and
tesseract-ocr/tesseract#2611, I reviewed thread limits and found that
thread limit of 3 is still beneficial, but not 4.

> time env OMP_THREAD_LIMIT=2 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
116.67user 1.67system 1:26.26elapsed 137%CPU (0avgtext+0avgdata 356752maxresident)k
2213inputs+0outputs (18major+131059minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=3 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
136.89user 1.63system 1:19.56elapsed 174%CPU (0avgtext+0avgdata 356784maxresident)k
821inputs+0outputs (0major+131080minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=4 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
161.31user 1.51system 1:18.80elapsed 206%CPU (0avgtext+0avgdata 356632maxresident)k
8477inputs+0outputs (12major+131074minor)pagefaults 0swaps
> time env OMP_THREAD_LIMIT=8 tesseract omp4.png stdout >/dev/null
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143
160.30user 1.62system 1:18.01elapsed 207%CPU (0avgtext+0avgdata 356640maxresident)k
821inputs+0outputs (0major+131078minor)pagefaults 0swaps
@amitdo
Copy link
Collaborator

amitdo commented Oct 21, 2019

Processors: 4 × Intel® Core™ i7-4600U CPU @ 2.10GHz

https://ark.intel.com/content/www/us/en/ark/products/76616/intel-core-i7-4600u-processor-4m-cache-up-to-3-30-ghz.html

# of Cores 2

@dagnelies
Copy link

Indeed, that whole multithreading thing caused more harm than good. There are a few issues around regarding this. I believe some people even compile specialized versions where OMP is completely removed since it runs more or less as fast, but with way less CPU consumption.

@zdenop
Copy link
Contributor

zdenop commented Oct 30, 2019

closing as duplicate to #263

@Shreeshrii
Copy link
Collaborator

@stweil I want to compare the timing on Power8 to AVX2. I notice that the results you reported were with tesseract 4.0.0. Please rerun the test with the latest code.

My current result is:

 time tesseract -l eng 62841051-2b65cf00-bcac-11e9-97df-bf85ff0b09bf.png 1
Tesseract Open Source OCR Engine v5.0.0-alpha-537-g6f31 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 143

real    0m1.109s
user    0m3.524s
sys     0m0.032s

@stweil
Copy link
Member

stweil commented Nov 17, 2019

@Shreeshrii, my results for Power8 differ significantly when I use tessdata_fast. Did you test with tessdata_best? Power8 can be improved a lot by using SIMD. For tessdata_best that is easy to implement.

Intermediate results (more results will get added later) with git master:

# Power8, fast, configure (default options)
real	0m0.802s
user	0m1.815s
sys	0m0.030s

# Power8, fast, configure --disable-openmp --disable-shared
real	0m1.243s
user	0m1.231s
sys	0m0.012s

# Power8, best, configure (default options)
real	0m1.329s
user	0m3.804s
sys	0m0.031s

# Power8, best, configure (default options), OMP_THREAD_LIMIT=1
real	0m3.155s
user	0m3.139s
sys	0m0.019s

# Power8, best, configure (default options), SIMD
real	0m1.144s
user	0m2.748s
sys	0m0.045s

# Power8, best, configure (default options), SIMD, OMP_THREAD_LIMIT=1
real	0m1.858s
user	0m1.842s
sys	0m0.019s

# Power8, best, configure --disable-openmp --disable-shared
real	0m2.981s
user	0m2.957s
sys	0m0.024s

# Power8, best, configure --disable-openmp --disable-shared, SIMD
real	0m1.686s
user	0m1.669s
sys	0m0.016s

@stweil
Copy link
Member

stweil commented Nov 18, 2019

Do any compile options also need to be changed?

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code. I also added -maltivec and -mabi=altivec in my test. Maybe -mcpu=native could be added for all files.

@stweil
Copy link
Member

stweil commented Nov 18, 2019

New results for an ARMv8 based NVIDIA Xavier running Ubunto Bionic:

# best, configure --disable-openmp --disable-shared
real	0m5.502s
user	0m5.352s
sys	0m0.080s

# best, configure --disable-openmp --disable-shared, SIMD
real	0m3.534s
user	0m3.400s
sys	0m0.080s

Tesseract must be called with -c dotproduct=native to use SIMD.

@stweil
Copy link
Member

stweil commented Nov 18, 2019

The old ARMv7 results were made with tessdata_fast. Here are new results:

# best, configure --disable-openmp --disable-shared
real	0m7.218s
user	0m6.859s
sys	0m0.248s

For this host, SIMD makes no difference. Tesseract uses NEON anyway.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 18, 2019

Can a different dot product calculation be used for Altivec - please see slide 16 in https://www.nxp.com/files-static/training_presentation/TP_ALTIVEC.pdf

int FastVectorDotProduct( vector float *v1, vector float *v2, int length ){
vector float temp = (vector float) vec_splat_s8(0);
vector float temp2 = temp; vector float temp3 = temp;
vector float temp4 = temp; vector float result;
for( int i = 0; i < length; i += 4){ //Loop over the length of the vectors,
temp = vec_madd( v1[i], v2[i], temp); //this time doing 4 vectors in parallel
temp2 = vec_madd( v1[i+1], v2[i+1], temp2); // to fill the pipeline
temp3 = vec_madd( v1[i+2], v2[i+2], temp3);
temp4 = vec_madd( v1[i+3], v2[i+3], temp4);
}
//Sum our temp vectors
temp = vec_add( temp, temp2 );
temp3 = vec_add( temp3, temp4 );
temp = vec_add( temp, temp3 );
//Add across the vector
temp = vec_add( temp, vec_sld( temp, temp, 4 ));
temp = vec_add(temp, vec_sld( temp, temp, 8 ));
//Copy the result to the stack so we can return it via the IPU
vec_ste( temp, 0, &result );
return result;
}

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2019

It's done automatically with the openmp-simd code.

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2019

@stweil, did you benchmarked this against the manual code in a x86-64 machine?

@Shreeshrii
Copy link
Collaborator

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code.

Shouldn't --enable-openmp set -fopenmp?

I added the following to my build script.

export CXXFLAGS="-fopenmp -maltivec -mabi=altivec -mcpu=power8"

Now, tesseract --version reports about OPENMP - haven't seen it before with --enable-openmp builds.

 tesseract -v
tesseract 5.0.0-alpha-554-g9ed3
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found OpenMP 201511

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2019

best, configure --disable-openmp --disable-shared

Of course it needs OpenMP (otherwise the compiler will raise an error with the current patch), so -fopenmp is required for that file even if OpenMP was disabled for the rest of the code.

@stweil, It does not make sense to disable openmp and then to enable it.

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2019

clang and gcc (>=4.9) both support the flag -fopenmp-simd.

@amitdo
Copy link
Collaborator

amitdo commented Nov 18, 2019

This code is more complete.

74f72e1#diff-d0fa47c1b7e2cb742a89b8c8f824df62R343

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 19, 2019

fast
default options
fast
OMP_THREAD_LIMIT=1
ripefig - intel real 1m9.096s
user 3m7.484s
sys 0m0.335s
real 0m0.366s
user 0m0.346s
sys 0m0.012s
stweil - intel real 0m0.327s
user 0m0.796s
sys 0m0.024s
stweil - ARMv7 real 0m1.864s
user 0m4.778s
sys 0m0.147s
real 0m2.078s
user 0m1.950s
sys 0m0.099s
stweil - Power8 real 0m0.802s
user 0m1.815s
sys 0m0.030s
stweil - Recent Debian real 0m0,209s
user 0m0,497s
sys 0m0,024s
real 0m0,255s
user 0m0,247s
sys 0m0,008s

@Shreeshrii
Copy link
Collaborator

best
default options
best
OMP_THREAD_LIMIT=1
best
--disable-openmp
--disable-shared
best
--disable-openmp
--disable-shared
SIMD
ARMv7 real 0m7.218s
user 0m6.859s
sys 0m0.248s
No Diff
ARMv8 real 0m5.502s
user 0m5.352s
sys 0m0.080s
real 0m3.534s
user 0m3.400s
sys 0m0.080s
Power8 real 0m1.329s
user 0m3.804s
sys 0m0.031s
real 0m3.155s
user 0m3.139s
sys 0m0.019s
real 0m2.981s
user 0m2.957s
sys 0m0.024s
real 0m1.686s
user 0m1.669s
sys 0m0.016s

@Shreeshrii
Copy link
Collaborator

What is obvious from the timing results is that there is a lot of variation across platforms and across options.

I saw a lot of variation in time even on the same platform - see #2611 (comment)

@Shreeshrii
Copy link
Collaborator

The training on Power8 should be much faster with SIMD.

@stweil Which parts of training process will be speeded up by this? lstmtraining? I will like to test/benchmark with/without the suggested SIMD test patch.

@stweil
Copy link
Member

stweil commented Nov 24, 2019

Yes, lstmtraining will be faster, both for the training part and for the evaluation. That process uses up to two cores when OpenMP is disabled or up to 8 cores with OpenMP.

But I still do not know how dotproduct=native can be enabled for lstmtraining.

@Shreeshrii
Copy link
Collaborator

You had also mentioned earlier, in a different thread, about unrolled loops. Should that also be implemented along with this for Power?

#2106 (comment)

@stweil
Copy link
Member

stweil commented Nov 24, 2019

Ideally loop unrolling should also be done by the compiler (try -O3 or -funroll-loops).

@Shreeshrii
Copy link
Collaborator

I set both -O3 -ffast-math (similar to

set(MARCH_NATIVE_FLAGS "${MARCH_NATIVE_FLAGS} -O3 -ffast-math")
) and the unittest linlsq_test failed with the following error. It works when I removed the -ffast-math.


Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LLSQTest
[ RUN      ] LLSQTest.BasicLines
[       OK ] LLSQTest.BasicLines (0 ms)
[ RUN      ] LLSQTest.Vectors
../../unittest/linlsq_test.cc:63: Failure
The difference between correct_vector.y() and vector.y() is 2, which exceeds tolerance, where
correct_vector.y() evaluates to 1,
vector.y() evaluates to -1, and
tolerance evaluates to 9.9999999747524271e-07.
[  FAILED  ] LLSQTest.Vectors (1 ms)
[ RUN      ] LLSQTest.RmsOrthWorksAsIntended
[       OK ] LLSQTest.RmsOrthWorksAsIntended (0 ms)
[----------] 3 tests from LLSQTest (1 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] LLSQTest.Vectors

@Shreeshrii
Copy link
Collaborator

lstmtraining will be faster, both for the training part and for the evaluation. That process uses up to two cores when OpenMP is disabled or up to 8 cores with OpenMP.

Is using up to 8 cores = 8 threads?

// A collection of DocumentData that knows roughly how much memory it is using.

// A collection of DocumentData that knows roughly how much memory it is using.
// Note that while it supports background read-ahead, it assumes that a single
// thread is accessing documents, ie it is not safe for multiple threads to
// access different documents in parallel, as one may de-cache the other's
// content.

@stweil
Copy link
Member

stweil commented Nov 25, 2019

Yes, I should have written "up to 8 threads". If there are only two CPUs with hyperthreading, those 8 threads will run on 4 cores, and the performance will be rather low. Of course you can use OMP_THREAD_LIMIT=4 to handle this, but I am not sure how that will distribute the cores for training and evaluation.

The cache for image data works also with a separate thread, but that does not use OpenMP, so it also works when OpenMP was disabled.

@Shreeshrii
Copy link
Collaborator

//ie it is not safe for multiple threads to
// access different documents in parallel, as one may de-cache the other's
// content.

I was asking about threads with regard to the above comment, whether multiple threads lead to slowing down.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 25, 2019

Tesstutorial
Phases
Master #pragma omp simd omp simd reduction
1-makedata

tesstrain.sh
real 87.50
user 107.15
sys 4.85
real 67.93
user 107.51
sys 4.88
real 67.77
user 107.19
sys 5.06
2-scratch

lstmtraining
lstmeval
real 7170.90
user 21735.34
sys 124.71

Error rate = 0.626
real 7244.33
user 22014.08
sys 128.46

Error rate = 0.751
real 7187.00
user 21784.96
sys 124.24

Error rate = 0.704
3-impact-from-small

lstmtraining
lstmeval
real 643.20
user 1893.29
sys 12.16

Error rate = 0.027
real 654.52
user 1936.23
sys 14.80

Error rate = 0.036
real 641.19
user 1873.26
sys 11.54

Error rate = 0.059
4-impact-from-full

lstmtraining
lstmeval
real 1407.57
user 4727.62
sys 16.75

Error rate = 0.307
real 1464.04
user 4887.35
sys 20.09

Error rate = 0.298
real 1407.77
user 4738.13
sys 17.04

Error rate = 0.269
5-makedata-plusminus

tesstrain.sh
real 91.26
user 111.95
sys 4.79
real 71.00
user 124.34
sys 5.82
real 68.00
user 112.19
sys 4.60
6-plusminus

lstmtraining
lstmeval
real 5975.63
user 18346.83
sys 60.97

Error rate = 0.013
real 7539.99
user 20610.39
sys 95.87

Error rate = 0.019
real 5956.91
user 18285.98
sys 62.21

Error rate = 0.025
7-layer

lstmtraining
lstmeval
real 2808.52
user 8775.01
sys 61.81

Error rate = 3.946
real 2793.53
user 8661.41
sys 51.00

Error rate = 4.012
real 1865.26
user 5614.33
sys 20.48

Error rate = 3.886

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 25, 2019

I have posted above the results of my test on power8 running tesstutorial (using scripts in shreeshrii/tess4training) with tesseract built from git master vs with SIMD patch as suggested by @stweil.

I ran the scripts one by one, without any other process running in the VM so as to get results that should be comparable.

The build was done using Advanced toolchain rather than the distro's gcc since

AT is highly recommended when you want to build an optimized CPU-bound application on POWER. ref: https://developer.ibm.com/linuxonpower/advance-toolchain/advtool-faq/

PATH=/opt/at12.0/bin:/opt/at12.0/sbin:$PATH gcc --version
gcc (GCC) 8.3.1 20190304 (Advance-Toolchain-at12.0) [revision 269374]

Build options in both cases included the following:

export CXXFLAGS="-O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp"

../../configure --enable-openmp --disable-debug --disable-opencl --disable-graphics --disable-shared --with-tensorflow=no 

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Nov 25, 2019

Questions:

  1. Is it OK to set the build options for all programs as I have done above?

  2. I am planning also to test using @amitdo 's suggestion to use

#pragma omp simd reduction(+:total)

Should I expect the result to be very different from

#pragma omp simd

  1. Is it OK to use the advanced toolchain?
  PATH=/opt/at12.0/bin:/opt/at12.0/sbin:$PATH gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/at12.0/libexec/gcc/powerpc64le-linux-gnu/8.3.1/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: /build/at12.0_Ubuntu16_ppc64le-ppc64le/12/at12.0-1.ubuntu-16_ppc64le_ppc64le/sources/gcc/configure --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu --with-cpu=default64 --prefix=/opt/at12.0 --with-long-double-128 --enable-secureplt --disable-multilib --with-advance-toolchain=at12.0 --with-glibc-version=2.28 --with-local-prefix=/opt/at12.0 --enable-threads=posix --enable-languages=c,c++,fortran,go --enable-__cxa_atexit --enable-shared --enable-checking=release --enable-lto --enable-gnu-indirect-function --enable-initfini-array --enable-linker-build-id --with-system-zlib --with-gmp-include=/opt/at12.0/include --with-gmp-lib=/opt/at12.0/lib64 --with-mpfr-include=/opt/at12.0/include --with-mpfr-lib=/opt/at12.0/lib64 --with-mpc-include=/opt/at12.0/include --with-mpc-lib=/opt/at12.0/lib64 --without-ppl --without-cloog --without-libelf --with-host-libstdcxx='-L/opt/at12.0/lib64 -lstdc++ -lsupc++ -lgmp -lgmpxx -lm' --with-cpu=power8 --with-tune=power8
Thread model: posix
gcc version 8.3.1 20190304 (Advance-Toolchain-at12.0) [revision 269374] (GCC)

or should i use the following?

 gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/7/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

@Shreeshrii
Copy link
Collaborator

Updated training test results in #2611 (comment)

@Danx69
Copy link

Danx69 commented Aug 14, 2021

Tesseract was very slow running it within a script but I noticed it was very fast within a terminal then, using also your suggestions, I modify the script line in "OMP_THREAD_LIMIT=1 xterm -geometry 1X1+0+0 -e tesseract file1 file2" and obtained the same speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants