tesstrain.sh script exits with error #1781

wincentbalin · 2018-07-16T23:54:07Z

Short description

I am trying to train Tesseract on Akkadian language. The language-specific.sh script was modified accordingly. When converting the training text to TIFF images, the text2image program crashes.

Environment

Tesseract Version: 3.04.01
Commit Number: the standard package in Ubuntu, package version 3.04.01-4, commit unknown
Platform: Linux ubuntu-xenial 4.4.0-130-generic pdfrenderer: Fix uninitialized local variables #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The environment was created using Vagrant. The commands are started on command line without GUI environment.

Running tesseract -v produces following output:

tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Current Behavior:

When running tesstrain.sh with these command

./tesstrain.sh --lang akk --training_text corpus-12pt.txt --tessdata_dir /usr/share/tesseract-ocr/tessdata --langdata_dir ../langdata --fonts_dir /usr/share/fonts --fontlist "CuneiformNAOutline Medium" "CuneiformOB" --output_dir .

the text2image crashes on every font with this message:

cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file stringrenderer.cpp, line 541

As a result, no box files are generated, so tesstrain.sh exits with these messages:

ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformOB.exp0.box does not exist or is not readable
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformNAOutline_Medium.exp0.box does not exist or is not readable

Expected Behavior:

tesstrain.sh should create the box files and proceed with training.

Attachments:

I attached all files used: akktrain.zip.

The fonts are hosted here, but for the sake of completeness the .ttf-files are included in the archive; they shoud be moved to /usr/share/fonts.

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-07-27T16:19:46Z

Please try with latest version of tesseract.

wincentbalin · 2018-07-28T23:08:19Z

Hello @Shreeshrii,

I ran the tesstrain.sh with the same options under Ubuntu Bionic (in a Docker container) and got the same results, as well as the attached coredump.

The version information is

tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libope
njp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

P.S.: the language-specific.sh script is attached too.

Shreeshrii · 2018-07-29T01:46:06Z

Save your training text as utf-8 (currently it seems to be in a different encoding -ANSI which does not display correctly).

wincentbalin · 2018-07-29T13:57:05Z

file corpus-12pt.txt recognizes UTF-8 encoding. Which software do you use to look at the text?

If you choose the right encoding, this image should appear:

Shreeshrii · 2018-07-29T16:10:58Z

I used notepad++ on windows. Yes, it displays correctly after changing encoding to utf-8.

I think the problem is being caused by extra long lines. I saved file as utf-8 and split the long lines to a smaller size and it seems to be working ok.

Make sure you are using the new version of tesstrain.sh (uninstall the version from tesseract 3).

Shreeshrii · 2018-08-01T15:41:46Z

@ stweil This assert seems to be related to the size of training_text line. Is it possible to have a more descriptive error message, if that is indeed the case.

cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file stringrenderer.cpp, line 541

wincentbalin · 2018-08-01T19:25:36Z

The issue #765 seems to be (roughly) related to this one.

wincentbalin · 2018-08-01T20:30:49Z

Dear @Shreeshrii ,

thank you very much for your help! The training worked beautifully after rewrapping the corpus (I wrote a short script in Python 3, as it works beautifully with utf-8 documents; you can find it here)

I had to rewrap the corpus by 35 characters per line. The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not work. But this is another issue, I think.

To be honest I both followed and disregarded your advice about using training scripts for Tesseract4 only at the same time. As a matter of fact, I need to use Tesseract3, thus I tested the old training scripts first. Nevertheless, I created a Docker container with Ubuntu Bionic and ran the training script for and with Tesseract4. It worked as well as with Tesseract3.

Hence we can regard the rewrapping of a corpus file as an official workaround for this issue. Shall I edit the wiki pages about training Tesseract3 and Tesseract4?

Shreeshrii · 2018-08-02T15:51:39Z

we can regard the rewrapping of a corpus file as an official workaround

for this issue. OK

The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not

work. But this is another issue, I think. I had to rewrap the corpus by 35 characters per line. I think 35 characters per line is dependent on the size of akkadian characters in the fonts that were used. Don't think that will be the case globally for all languages.

As a matter of fact, I need to use Tesseract3, thus I tested the old

training scripts first. Sure. My suggestion was geared more towards using tesseract4. What kind of accuracy are you getting with the akk traineddata with tesseract3?

…

On Thu, Aug 2, 2018 at 2:01 AM Wincent Balin ***@***.***> wrote: Dear @Shreeshrii <https://github.com/Shreeshrii> , thank you very much for your help! The training worked after rewrapping (I wrote a short script in Python 3, as it works beautifully with utf-8 documents; you can find it here <https://gist.github.com/wincentbalin/85707ce703b1e4bc0737ed569fb16bea>), but I had to rewrap the corpus by 35 characters per line. The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not work. But this is another issue, I think. To be honest I both followed and disregarded your advice about using training scripts for Tesseract4 only at the same time. As a matter of fact, I need to use Tesseract3, thus I tested the old training scripts first. Nevertheless, I created a Docker container with Ubuntu Bionic and ran the training script for and with Tesseract4. It worked as well as with Tesseract3. Hence we can regard the rewrapping of a corpus file as an official workaround for this issue. Shall I edit the wiki pages about training with Tesseract3 and Tesseract4? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozf7-j7YdFFuBuPcW_6EdZexO-Ckks5uMhAkgaJpZM4VR-Rj> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

amitdo · 2018-08-02T21:13:09Z

https://en.wikipedia.org/wiki/Cuneiform_script

It will be hard to get a good accuracy on >1900 years old material.

wincentbalin · 2018-09-19T15:22:31Z

What kind of accuracy are you getting with the akk traineddata with tesseract3?

Currently the WER is around 10 per cent, but sometimes I got it lower. I think it requires some tinkering.

The program I use takes random words from the wordlist, creates texts from it and saves the text into an image using text2image. Then tesseract is used to recognize the text back and the WER of the result in comparison to the original text is calculated.

Shreeshrii · 2018-09-19T15:29:31Z

Is this with tesseract 3.05? Training for legacy engine? I had done a training run for LSTM but didn't test it. I will share it in a day or two, I am traveling now. If you can make your test images and ground truth available some place, I can check accuracy too.

…

On Wed, 19 Sep 2018, 20:53 Wincent Balin, ***@***.***> wrote: What kind of accuracy are you getting with the akk traineddata with tesseract3? Currently the WER is around 10 per cent, but sometimes I got it lower. The program I use takes random words from the wordlist, creates texts from it and saves the text into an image using text2image. Then tesseract is used to recognize the text back and the WER of the result in comparison to the original text is calculated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_owq8wFdtZIlOIjSlTHb65rpmAAqyks5ucmFrgaJpZM4VR-Rj> .

zdenop · 2018-10-05T17:17:02Z

See note above code section that cause crash:

  // There is a subtle bug in the cluster text reported by the PangoLayoutIter
  // on ligatured characters (eg. The word "Lam-Aliph" in arabic). To work
  // around this, we use text reported using the PangoGlyphIter which is
  // accurate.
  // TODO(ranjith): Revisit whether this is still needed in newer versions of
  // pango.

Shreeshrii · 2018-10-06T13:16:40Z

>Currently the WER is around 10 per cent, but sometimes I got it lower.

I looked at the LSTM training results that I have. They have CER of less than 10% and WER of 15%.

stweil · 2018-10-13T19:40:42Z

Issue #765 is a duplicate.

zdenop · 2018-10-15T09:47:31Z

@stweil : can you provide font and text that failed for you?

Shreeshrii · 2019-02-26T10:58:34Z

@zdenop The first post about issue has the info you want.

Attachments:

I attached all files used: akktrain.zip.

The fonts are hosted here, but for the sake of completeness the .ttf-files are included in the archive; they shoud be moved to /usr/share/fonts.

@wincentbalin had a workaround the problem too..

Hence we can regard the rewrapping of a corpus file as an official workaround for this issue. Shall I edit the wiki pages about training Tesseract3 and Tesseract4?

Shreeshrii · 2019-03-30T19:03:00Z

~/tessdata_akk$ valgrind --tool=memcheck /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.opjpN7f94T --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=-1 --outputbase=/tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1 --max_pages=0 --font=CuneiformOB --text=./langdata/akk/corpus-12pt.txt
==3121== Memcheck, a memory error detector
==3121== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3121== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3121== Command: /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.opjpN7f94T --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=-1 --outputbase=/tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1 --max_pages=0 --font=CuneiformOB --text=./langdata/akk/corpus-12pt.txt
==3121==
Rendered page 0 to file /tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1.tif
Rendered page 1 to file /tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1.tif
Rendered page 2 to file /tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1.tif
Rendered page 3 to file /tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1.tif
Rendered page 4 to file /tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1.tif
cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file ../../../src/training/stringrenderer.cpp, line 546
==3121==
==3121== Process terminating with default action of signal 5 (SIGTRAP)
==3121==    at 0x101D7844: ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const (errcode.cpp:84)
==3121==    by 0x10027487: tesseract::StringRenderer::ComputeClusterBoxes() (stringrenderer.cpp:546)
==3121==    by 0x10028617: tesseract::StringRenderer::RenderToImage(char const*, int, Pix**) (stringrenderer.cpp:806)
==3121==    by 0x10015E1B: Main() (text2image.cpp:632)
==3121==    by 0x1000A0FB: main (text2image.cpp:736)
==3121==
==3121== HEAP SUMMARY:
==3121==     in use at exit: 141,636,503 bytes in 11,076 blocks
==3121==   total heap usage: 258,218 allocs, 247,142 frees, 1,652,783,976 bytes allocated
==3121==
==3121== LEAK SUMMARY:
==3121==    definitely lost: 7,168 bytes in 11 blocks
==3121==    indirectly lost: 12,207 bytes in 474 blocks
==3121==      possibly lost: 1,514 bytes in 21 blocks
==3121==    still reachable: 141,613,430 bytes in 10,559 blocks
==3121==                       of which reachable via heuristic:
==3121==                         length64           : 80 bytes in 2 blocks
==3121==                         newarray           : 1,568 bytes in 18 blocks
==3121==         suppressed: 0 bytes in 0 blocks
==3121== Rerun with --leak-check=full to see details of leaked memory
==3121==
==3121== For counts of detected and suppressed errors, rerun with: -v
==3121== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Trace/breakpoint trap

Shreeshrii · 2019-03-30T19:27:17Z

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=./valgrind-out.txt /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.opjpN7f94T --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=-1 --outputbase=/tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1 --max_pages=0 --font=CuneiformOB --text=./langdata/akk/corpus-12pt.txt

valgrind-out.txt

stweil · 2019-03-31T20:10:25Z

Memory leaks after a fatal assertion are normal: the program terminates immediately without executing destructors or cleaning up memory.

zdenop · 2019-10-21T12:40:12Z

it crash also with pango 1.44.6.
Has anybody latin script based text for reproducing this crash?

eighttails · 2019-11-30T13:54:29Z

Same problem occurred to me.
My workaround is to remove empty lines from training_text.
It worked for me.

lililala6868 · 2020-05-27T04:48:08Z

I am trying to train https://1680380.com/view/fc3D/index.html Tesseract on Akkadian language. The language-specific.sh script was modified accordingly. When converting the training text to TIFF images, the text2image program crashes.

Shreeshrii · 2020-05-27T06:07:10Z

Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM training input, training steps and resulting traineddata files.

You can change the training text and fonts to customize and further finetune the models.

haddadymh · 2021-03-01T09:37:35Z

i have a question. when I run sh generate_training_data.sh
have an errore: 2: tesstrain.sh: not found
i go to tesseract/src/tesstrain.sh
but it not work

zdenop added the training label Sep 30, 2018

zdenop mentioned this issue Oct 13, 2018

Assertion in stringrenderer when running text2image #765

Closed

zdenop mentioned this issue Oct 16, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

zdenop added the documentation label Oct 18, 2018

Shreeshrii mentioned this issue Mar 30, 2019

stringrenderer_test failures #2302

Closed

amitdo added the text2image label Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesstrain.sh script exits with error #1781

tesstrain.sh script exits with error #1781

wincentbalin commented Jul 16, 2018 •

edited

Loading

Shreeshrii commented Jul 27, 2018

wincentbalin commented Jul 28, 2018

Shreeshrii commented Jul 29, 2018

wincentbalin commented Jul 29, 2018

Shreeshrii commented Jul 29, 2018

Shreeshrii commented Aug 1, 2018

wincentbalin commented Aug 1, 2018

wincentbalin commented Aug 1, 2018 •

edited

Loading

Shreeshrii commented Aug 2, 2018 via email

amitdo commented Aug 2, 2018

wincentbalin commented Sep 19, 2018 •

edited

Loading

Shreeshrii commented Sep 19, 2018 via email

zdenop commented Oct 5, 2018 •

edited

Loading

Shreeshrii commented Oct 6, 2018 via email

stweil commented Oct 13, 2018

zdenop commented Oct 15, 2018

Shreeshrii commented Feb 26, 2019

Shreeshrii commented Mar 30, 2019

Shreeshrii commented Mar 30, 2019

stweil commented Mar 31, 2019

zdenop commented Oct 21, 2019

eighttails commented Nov 30, 2019

lililala6868 commented May 27, 2020

Shreeshrii commented May 27, 2020

haddadymh commented Mar 1, 2021

tesstrain.sh script exits with error #1781

tesstrain.sh script exits with error #1781

Comments

wincentbalin commented Jul 16, 2018 • edited Loading

Short description

Environment

Current Behavior:

Expected Behavior:

Attachments:

Shreeshrii commented Jul 27, 2018

wincentbalin commented Jul 28, 2018

Shreeshrii commented Jul 29, 2018

wincentbalin commented Jul 29, 2018

Shreeshrii commented Jul 29, 2018

Shreeshrii commented Aug 1, 2018

wincentbalin commented Aug 1, 2018

wincentbalin commented Aug 1, 2018 • edited Loading

Shreeshrii commented Aug 2, 2018 via email

amitdo commented Aug 2, 2018

wincentbalin commented Sep 19, 2018 • edited Loading

Shreeshrii commented Sep 19, 2018 via email

zdenop commented Oct 5, 2018 • edited Loading

Shreeshrii commented Oct 6, 2018 via email

stweil commented Oct 13, 2018

zdenop commented Oct 15, 2018

Shreeshrii commented Feb 26, 2019

Attachments:

Shreeshrii commented Mar 30, 2019

Shreeshrii commented Mar 30, 2019

stweil commented Mar 31, 2019

zdenop commented Oct 21, 2019

eighttails commented Nov 30, 2019

lililala6868 commented May 27, 2020

Shreeshrii commented May 27, 2020

haddadymh commented Mar 1, 2021

wincentbalin commented Jul 16, 2018 •

edited

Loading

wincentbalin commented Aug 1, 2018 •

edited

Loading

wincentbalin commented Sep 19, 2018 •

edited

Loading

zdenop commented Oct 5, 2018 •

edited

Loading