-
Notifications
You must be signed in to change notification settings - Fork 19
Multiple jobs do not work with Tesseract 4 #31
Comments
Is this reproducible for you? That is, when you run this command again, does it hang too? Unfortunately, I wasn't able to reproduce the bug here: $ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #12
real 1m57.306s
user 2m55.692s
sys 2m29.692s This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140). This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow". But setting In the mean time, you can set this manually: $ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #4
...
- Page #12
real 0m37.206s
user 1m45.108s
sys 0m2.484s |
This comment has been minimized.
This comment has been minimized.
Sorry, here am I again with the same issue. First, without OMP_THREAD_LIMIT=1 situation is the same:
However:
Finally, all works with "-j 1" as it should (but, to my surprise, only two times slower then with "-j 4"). Wild idea: does it reflect the difference between Intel Xeon and Intel Core i7? If so, I should try a different machine. |
So another machine (sorry, I do not have Xeons but this is i5 with eight cores):
This is really weird! |
Color me baffled. :-/ I've hacked up a script to dump some information about the hanging processes: I'd like you to do the following:
# sysctl kernel.yama.ptrace_scope=0
# apt-get install gdb djvulibre-dbg libc6-dbg libgcc1-dbg libgomp1-dbg libstdc++6-5-dbg python-djvu-dbg python-lxml-dbg python2.7-dbg
Send me the file with the output by email, or zip it and attach here. |
This was implemented in 0.11. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Now examine_hangs.sh hangs itself ;) but output something large. I
attach ZIP because output is bulky.
AS
сб, 23 февр. 2019 г. в 15:38, Alexey Shipunov <dactylorhiza@gmail.com>:
…
Sure. Five minutes.
AS
сб, 23 февр. 2019 г. в 15:37, Jakub Wilk ***@***.***>:
>
> Yikes, there was a bug in the examination script that broke it almost completely. :-(
> I've fixed the in 0ca41df.
> Could you try again with the updated script?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or mute the thread.
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Here's the summary of the report I received from @ashipunov:
|
So to summarize summary, it is still unclear why... I believe that my
hypothesis about processor-related issue might be feasible.
пт, 1 мар. 2019 г. в 12:02, Jakub Wilk <notifications@github.com>:
… Here's the summary:
-
There's the ocrodjvu process running (with 10 threads), and 8
tesseract processes (with 4 threads each).
-
After almost 6 minutes, only the first page is done. All the tesseract
processes seem to be consuming CPU:
PID LWP S STARTED ELAPSED TIME %CPU RSZ VSZ COMMAND
7493 - - 15:41:18 05:57 00:00:00 0.1 63360 1395120 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
7516 - - 15:41:18 05:57 00:03:28 58.5 54180 120080 tesseract /tmp/ocrodjvu.a6lLsl/000002.tif /tmp/ocrodjvu.RCQ4D3/tmp -l rus+lat /tmp/ocrodjvu.RCQ4D3/tessconf
7517 - - 15:41:18 05:57 00:04:55 82.8 72088 137560 tesseract /tmp/ocrodjvu.a6lLsl/000005.tif /tmp/ocrodjvu.VawxBw/tmp -l rus+lat /tmp/ocrodjvu.VawxBw/tessconf
7518 - - 15:41:18 05:57 00:06:25 107 59708 124680 tesseract /tmp/ocrodjvu.a6lLsl/000007.tif /tmp/ocrodjvu.KOh_L6/tmp -l rus+lat /tmp/ocrodjvu.KOh_L6/tessconf
7519 - - 15:41:18 05:57 00:06:26 108 73656 138964 tesseract /tmp/ocrodjvu.a6lLsl/000003.tif /tmp/ocrodjvu.e5v4Qh/tmp -l rus+lat /tmp/ocrodjvu.e5v4Qh/tessconf
7520 - - 15:41:18 05:57 00:06:24 107 70500 135832 tesseract /tmp/ocrodjvu.a6lLsl/000006.tif /tmp/ocrodjvu.1GJ9Kc/tmp -l rus+lat /tmp/ocrodjvu.1GJ9Kc/tessconf
7521 - - 15:41:18 05:57 00:06:24 107 73040 138592 tesseract /tmp/ocrodjvu.a6lLsl/000004.tif /tmp/ocrodjvu.VgIIXx/tmp -l rus+lat /tmp/ocrodjvu.VgIIXx/tessconf
7560 - - 15:41:30 05:45 00:06:10 107 70360 135840 tesseract /tmp/ocrodjvu.a6lLsl/000008.tif /tmp/ocrodjvu.96qHYI/tmp -l rus+lat /tmp/ocrodjvu.96qHYI/tessconf
7581 - - 15:42:32 04:43 00:05:09 109 74156 139876 tesseract /tmp/ocrodjvu.a6lLsl/000009.tif /tmp/ocrodjvu.4XLOur/tmp -l rus+lat /tmp/ocrodjvu.4XLOur/tessconf
-
Backtraces from ocrodjvu threads look fine:
- the main thread:
Waiting for the GIL
File "/usr/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 512, in _process
condition.wait()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 549, in process
self._process(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 567, in main
context.process(options.path, options.pages)
File "/usr/local/bin/ocrodjvu", line 7, in <module>
_.main(sys.argv)
- internal python-djvulibre thread:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fdb8cf085f6 in DJVU::GMonitor::wait ***@***.***=0x1e44af0) at GThreads.cpp:576
#2 0x00007fdb8cf47ec0 in ddjvu_message_wait (ctx=0x1e44ae0) at ddjvuapi.cpp:733
#3 0x00007fdb8c3a7319 in __pyx_pf_4djvu_6decode__Context_message_distributor (__pyx_self=<optimized out>, __pyx_v_kwargs={'sentinel': <object at remote 0x7fdb8f2e4120>}, __pyx_v_self=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15397
#4 __pyx_pw_4djvu_6decode_1_Context_message_distributor (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15312
#5 0x00000000004a587e in PyObject_Call () at ../Objects/abstract.c:2546
#6 0x00000000004c5f3d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4219
#7 0x0000000000589662 in t_bootstrap () at ../Modules/threadmodule.c:620
#8 0x00007fdb8efd46ba in start_thread (arg=0x7fdb89105700) at pthread_create.c:333
#9 0x00007fdb8ed0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
- 8 worker threads:
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 67, in _wait_for_worker
stderr = worker.stderr.readlines()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
_wait_for_worker(worker)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
return f(image, language, details=details, uax29=uax29)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
result = self.process_page(page)
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
-
There are no backtraces for tesseract processes, because apparently
GDB hangs on them. :-(
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#31 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAReQLoQRZAi8nTxLkKD-p-A0O8CWuIGks5vSWsxgaJpZM4agad2>
.
|
I installed today Tesseract 4 from Ubuntu ppa (ppa:alex-p/tesseract-ocr, 4.0.0+git3515-9bcfa90c-1ppa1~xenial1). Tesseract itself works normal, and ocrodjvu also works OK with the default "j=1". However, when I specifed "j=4", ocrodjvu hangs and when I break it, I have the following:
I know that there are issues with multi-threading so I used recommendations from
tesseract-ocr/tesseract#898
and from
https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/
to set the environment as 'OMP_THREAD_LIMIT=1 tesseract'. However, all my attempts, namely (a) rename executable and replace it with the script, (b) make script which contains the alias and finally (c) change your code to allow this environment variable, failed.
My system info output:
Ocrodjvu version:
In the end, I reverted everything to Tesseract 3, and now it works. This means, for example, that I cannot OCR books in Armenian and Quechua as these languages for some reason are not in Tesseract 3.
Please help.
The text was updated successfully, but these errors were encountered: