Skip to content
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.

Multiple jobs do not work with Tesseract 4 #31

Open
ashipunov opened this issue Feb 4, 2019 · 18 comments
Open

Multiple jobs do not work with Tesseract 4 #31

ashipunov opened this issue Feb 4, 2019 · 18 comments

Comments

@ashipunov
Copy link

ashipunov commented Feb 4, 2019

I installed today Tesseract 4 from Ubuntu ppa (ppa:alex-p/tesseract-ocr, 4.0.0+git3515-9bcfa90c-1ppa1~xenial1). Tesseract itself works normal, and ocrodjvu also works OK with the default "j=1". However, when I specifed "j=4", ocrodjvu hangs and when I break it, I have the following:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu 
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
^Ctesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
tesseract: Detected 105 diacritics
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
Exception while processing page 3:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Exception while processing page 4:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
Interrupted by user.
Exception while processing page 5:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Exception while processing page 6:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Intermediate files were left in the '/tmp/ocrodjvu.3_ZmXE' directory.

real	30m20.909s
user	118m6.372s
sys	0m10.420s

I know that there are issues with multi-threading so I used recommendations from

tesseract-ocr/tesseract#898

and from

https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/

to set the environment as 'OMP_THREAD_LIMIT=1 tesseract'. However, all my attempts, namely (a) rename executable and replace it with the script, (b) make script which contains the alias and finally (c) change your code to allow this environment variable, failed.

My system info output:

$ inxi
CPU~Dual core Intel Core i7-2620M (-HT-MCP-) speed/max~799/3400 MHz Kernel~4.4.0-141-generic x86_64 Up~4:18 Mem~1518.7/7865.9MB HDD~2000.4GB(30.6% used) Procs~197 Client~Shell inxi~2.2.35 

Ocrodjvu version:

$ ocrodjvu --version
ocrodjvu 0.10.4
+ Python 2.7.12
+ subprocess32
+ python-djvulibre 0.7
+ lxml 3.5.0
+ html5lib-python 0.999
+ PyICU 1.9.2
  + ICU 55.1
    + Unicode 7.0

In the end, I reverted everything to Tesseract 3, and now it works. This means, for example, that I cannot OCR books in Armenian and Quechua as these languages for some reason are not in Tesseract 3.

Please help.

@jwilk
Copy link
Member

jwilk commented Feb 5, 2019

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu 

Is this reproducible for you? That is, when you run this command again, does it hang too?

Unfortunately, I wasn't able to reproduce the bug here:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #12

real    1m57.306s
user    2m55.692s
sys     2m29.692s

This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140).

tesseract-ocr/tesseract#898

This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow".

But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically for the next release.

In the mean time, you can set this manually:

$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #4
...
- Page #12

real    0m37.206s
user    1m45.108s
sys     0m2.484s

@ashipunov

This comment has been minimized.

@jwilk jwilk added the bug label Feb 6, 2019
@jwilk jwilk mentioned this issue Feb 8, 2019
@ashipunov
Copy link
Author

Sorry, here am I again with the same issue. First, without OMP_THREAD_LIMIT=1 situation is the same:

$ inxi
CPU~Dual core Intel Core i7-2620M (-HT-MCP-) speed/max~804/3400 MHz Kernel~4.4.0-141-generic x86_64
$ ocrodjvu --version
ocrodjvu 0.10.4
+ Python 2.7.12
+ subprocess32
+ python-djvulibre 0.7
+ lxml 3.5.0
+ html5lib-python 0.999
+ PyICU 1.9.2
  + ICU 55.1
    + Unicode 7.0
$ tesseract --version
tesseract 4.0.0-297-gec8f
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
$ # now:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
^C
...
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.k9gIME' directory.
real	14m34.142s
user	58m3.444s
sys	0m4.336s

However:

$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #3
...
- Page #12
real	1m29.354s
user	5m24.756s
sys	0m3.328s

Finally, all works with "-j 1" as it should (but, to my surprise, only two times slower then with "-j 4").

Wild idea: does it reflect the difference between Intel Xeon and Intel Core i7? If so, I should try a different machine.

@ashipunov
Copy link
Author

So another machine (sorry, I do not have Xeons but this is i5 with eight cores):

$  inxi
CPU~Quad core Intel Core i5-8250U (-HT-MCP-) speed/max~938/3400 MHz Kernel~4.15.0-43-generic x86_64
$ time ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu 
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #10
^C
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.0DU1iF' directory.
real	14m43.246s
user	117m31.202s
sys	0m1.366s
$ # but:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3 
...
- Page #12
real	0m32.009s
user	3m30.699s
sys	0m3.298s

This is really weird!
I believe that this is a bug, probably associated with Intel i* processors.

@jwilk
Copy link
Member

jwilk commented Feb 11, 2019

Color me baffled. :-/

I've hacked up a script to dump some information about the hanging processes:
examine-hangs.
Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…

I'd like you to do the following:

  1. Disable ptrace restrictions that would prevent GDB from working:
# sysctl kernel.yama.ptrace_scope=0
  1. Install GDB and a bunch of debug packages:
# apt-get install gdb djvulibre-dbg libc6-dbg libgcc1-dbg libgomp1-dbg libstdc++6-5-dbg python-djvu-dbg python-lxml-dbg python2.7-dbg
  1. Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.

  2. Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)

Send me the file with the output by email, or zip it and attach here.

@jwilk
Copy link
Member

jwilk commented Feb 11, 2019

But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically

This was implemented in 0.11.

@ashipunov

This comment has been minimized.

@ashipunov

This comment has been minimized.

@jwilk

This comment has been minimized.

@ashipunov

This comment has been minimized.

@jwilk

This comment has been minimized.

@ashipunov

This comment has been minimized.

@ashipunov
Copy link
Author

ashipunov commented Feb 23, 2019 via email

@ashipunov

This comment has been minimized.

@jwilk

This comment has been minimized.

@ashipunov

This comment has been minimized.

@jwilk
Copy link
Member

jwilk commented Mar 1, 2019

Here's the summary of the report I received from @ashipunov:

  • There's the ocrodjvu process running (with 10 threads), and 8 tesseract processes (with 4 threads each).

  • After almost 6 minutes, only the first page is done. All the tesseract processes seem to be consuming CPU:

      PID   LWP S  STARTED     ELAPSED     TIME %CPU   RSZ    VSZ COMMAND
     7493     - - 15:41:18       05:57 00:00:00  0.1 63360 1395120 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
     7516     - - 15:41:18       05:57 00:03:28 58.5 54180 120080 tesseract /tmp/ocrodjvu.a6lLsl/000002.tif /tmp/ocrodjvu.RCQ4D3/tmp -l rus+lat /tmp/ocrodjvu.RCQ4D3/tessconf
     7517     - - 15:41:18       05:57 00:04:55 82.8 72088 137560 tesseract /tmp/ocrodjvu.a6lLsl/000005.tif /tmp/ocrodjvu.VawxBw/tmp -l rus+lat /tmp/ocrodjvu.VawxBw/tessconf
     7518     - - 15:41:18       05:57 00:06:25  107 59708 124680 tesseract /tmp/ocrodjvu.a6lLsl/000007.tif /tmp/ocrodjvu.KOh_L6/tmp -l rus+lat /tmp/ocrodjvu.KOh_L6/tessconf
     7519     - - 15:41:18       05:57 00:06:26  108 73656 138964 tesseract /tmp/ocrodjvu.a6lLsl/000003.tif /tmp/ocrodjvu.e5v4Qh/tmp -l rus+lat /tmp/ocrodjvu.e5v4Qh/tessconf
     7520     - - 15:41:18       05:57 00:06:24  107 70500 135832 tesseract /tmp/ocrodjvu.a6lLsl/000006.tif /tmp/ocrodjvu.1GJ9Kc/tmp -l rus+lat /tmp/ocrodjvu.1GJ9Kc/tessconf
     7521     - - 15:41:18       05:57 00:06:24  107 73040 138592 tesseract /tmp/ocrodjvu.a6lLsl/000004.tif /tmp/ocrodjvu.VgIIXx/tmp -l rus+lat /tmp/ocrodjvu.VgIIXx/tessconf
     7560     - - 15:41:30       05:45 00:06:10  107 70360 135840 tesseract /tmp/ocrodjvu.a6lLsl/000008.tif /tmp/ocrodjvu.96qHYI/tmp -l rus+lat /tmp/ocrodjvu.96qHYI/tessconf
     7581     - - 15:42:32       04:43 00:05:09  109 74156 139876 tesseract /tmp/ocrodjvu.a6lLsl/000009.tif /tmp/ocrodjvu.4XLOur/tmp -l rus+lat /tmp/ocrodjvu.4XLOur/tessconf
    
  • Backtraces from ocrodjvu threads look fine:

    • the main thread:
        Waiting for the GIL
        File "/usr/lib/python2.7/threading.py", line 340, in wait
          waiter.acquire()
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 512, in _process
          condition.wait()
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 549, in process
          self._process(*args, **kwargs)
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 567, in main
          context.process(options.path, options.pages)
        File "/usr/local/bin/ocrodjvu", line 7, in <module>
          _.main(sys.argv)
      
    • internal python-djvulibre thread:
      #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
      #1  0x00007fdb8cf085f6 in DJVU::GMonitor::wait (this=this@entry=0x1e44af0) at GThreads.cpp:576
      #2  0x00007fdb8cf47ec0 in ddjvu_message_wait (ctx=0x1e44ae0) at ddjvuapi.cpp:733
      #3  0x00007fdb8c3a7319 in __pyx_pf_4djvu_6decode__Context_message_distributor (__pyx_self=<optimized out>, __pyx_v_kwargs={'sentinel': <object at remote 0x7fdb8f2e4120>}, __pyx_v_self=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15397
      #4  __pyx_pw_4djvu_6decode_1_Context_message_distributor (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15312
      #5  0x00000000004a587e in PyObject_Call () at ../Objects/abstract.c:2546
      #6  0x00000000004c5f3d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4219
      #7  0x0000000000589662 in t_bootstrap () at ../Modules/threadmodule.c:620
      #8  0x00007fdb8efd46ba in start_thread (arg=0x7fdb89105700) at pthread_create.c:333
      #9  0x00007fdb8ed0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
      
    • 8 worker threads:
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 67, in _wait_for_worker
          stderr = worker.stderr.readlines()
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
          _wait_for_worker(worker)
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
          return f(image, language, details=details, uax29=uax29)
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
          result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
        File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
          result = self.process_page(page)
        File "/usr/lib/python2.7/threading.py", line 754, in run
          self.__target(*self.__args, **self.__kwargs)
        File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
          self.run()
        File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
          self.__bootstrap_inner()
      
  • There are no backtraces for tesseract processes, because apparently GDB hangs on them. :-(

@ashipunov
Copy link
Author

ashipunov commented Mar 1, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

2 participants