-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesstrain.sh script exits with error #1781
Comments
Please try with latest version of tesseract. |
Hello @Shreeshrii, I ran the The version information is
P.S.: the |
Save your training text as utf-8 (currently it seems to be in a different encoding -ANSI which does not display correctly). |
I used notepad++ on windows. Yes, it displays correctly after changing encoding to utf-8. I think the problem is being caused by extra long lines. I saved file as utf-8 and split the long lines to a smaller size and it seems to be working ok. Make sure you are using the new version of tesstrain.sh (uninstall the version from tesseract 3). |
@ stweil This assert seems to be related to the size of training_text line. Is it possible to have a more descriptive error message, if that is indeed the case.
|
The issue #765 seems to be (roughly) related to this one. |
Dear @Shreeshrii , thank you very much for your help! The training worked beautifully after rewrapping the corpus (I wrote a short script in Python 3, as it works beautifully with utf-8 documents; you can find it here) I had to rewrap the corpus by 35 characters per line. The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not work. But this is another issue, I think. To be honest I both followed and disregarded your advice about using training scripts for Tesseract4 only at the same time. As a matter of fact, I need to use Tesseract3, thus I tested the old training scripts first. Nevertheless, I created a Docker container with Ubuntu Bionic and ran the training script for and with Tesseract4. It worked as well as with Tesseract3. Hence we can regard the rewrapping of a corpus file as an official workaround for this issue. Shall I edit the wiki pages about training Tesseract3 and Tesseract4? |
we can regard the rewrapping of a corpus file as an official workaround
for this issue.
OK
The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not
work. But this is another issue, I think. I had to rewrap the corpus by 35
characters per line.
I think 35 characters per line is dependent on the size of akkadian
characters in the fonts that were used. Don't think that will be the case
globally for all languages.
As a matter of fact, I need to use Tesseract3, thus I tested the old
training scripts first.
Sure. My suggestion was geared more towards using tesseract4.
What kind of accuracy are you getting with the akk traineddata with
tesseract3?
…On Thu, Aug 2, 2018 at 2:01 AM Wincent Balin ***@***.***> wrote:
Dear @Shreeshrii <https://github.com/Shreeshrii> ,
thank you very much for your help! The training worked after rewrapping (I
wrote a short script in Python 3, as it works beautifully with utf-8
documents; you can find it here
<https://gist.github.com/wincentbalin/85707ce703b1e4bc0737ed569fb16bea>),
but I had to rewrap the corpus by 35 characters per line.
The widths of 10, 20, 30, 40, 50, 60 and 70 characters per line did not
work. But this is another issue, I think.
To be honest I both followed and disregarded your advice about using
training scripts for Tesseract4 only at the same time. As a matter of fact,
I need to use Tesseract3, thus I tested the old training scripts first.
Nevertheless, I created a Docker container with Ubuntu Bionic and ran the
training script for and with Tesseract4. It worked as well as with
Tesseract3.
Hence we can regard the rewrapping of a corpus file as an official
workaround for this issue. Shall I edit the wiki pages about training with
Tesseract3 and Tesseract4?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozf7-j7YdFFuBuPcW_6EdZexO-Ckks5uMhAkgaJpZM4VR-Rj>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
https://en.wikipedia.org/wiki/Cuneiform_script It will be hard to get a good accuracy on >1900 years old material. |
Currently the WER is around 10 per cent, but sometimes I got it lower. I think it requires some tinkering. The program I use takes random words from the wordlist, creates texts from it and saves the text into an image using |
Is this with tesseract 3.05? Training for legacy engine?
I had done a training run for LSTM but didn't test it. I will share it in a
day or two, I am traveling now.
If you can make your test images and ground truth available some place, I
can check accuracy too.
…On Wed, 19 Sep 2018, 20:53 Wincent Balin, ***@***.***> wrote:
What kind of accuracy are you getting with the akk traineddata with
tesseract3?
Currently the WER is around 10 per cent, but sometimes I got it lower.
The program I use takes random words from the wordlist, creates texts from
it and saves the text into an image using text2image. Then tesseract is
used to recognize the text back and the WER of the result in comparison to
the original text is calculated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_owq8wFdtZIlOIjSlTHb65rpmAAqyks5ucmFrgaJpZM4VR-Rj>
.
|
See note above code section that cause crash:
|
>Currently the WER is around 10 per cent, but sometimes I got it lower.
I looked at the LSTM training results that I have. They have CER of less
than 10% and WER of 15%.
|
Issue #765 is a duplicate. |
@stweil : can you provide font and text that failed for you? |
@zdenop The first post about issue has the info you want. Attachments:
@wincentbalin had a workaround the problem too..
|
|
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=./valgrind-out.txt /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.opjpN7f94T --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=-1 --outputbase=/tmp/tmp.tH84CXo6fq/akk/akk.CuneiformOB.exp-1 --max_pages=0 --font=CuneiformOB --text=./langdata/akk/corpus-12pt.txt |
Memory leaks after a fatal assertion are normal: the program terminates immediately without executing destructors or cleaning up memory. |
it crash also with pango 1.44.6. |
Same problem occurred to me. |
I am trying to train https://1680380.com/view/fc3D/index.html Tesseract on Akkadian language. The language-specific.sh script was modified accordingly. When converting the training text to TIFF images, the text2image program crashes. |
Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM training input, training steps and resulting traineddata files. You can change the training text and fonts to customize and further finetune the models. |
i have a question. when I run sh generate_training_data.sh |
Short description
I am trying to train Tesseract on Akkadian language. The
language-specific.sh
script was modified accordingly. When converting the training text to TIFF images, thetext2image
program crashes.Environment
The environment was created using Vagrant. The commands are started on command line without GUI environment.
Running
tesseract -v
produces following output:Current Behavior:
When running
tesstrain.sh
with these commandthe
text2image
crashes on every font with this message:As a result, no box files are generated, so
tesstrain.sh
exits with these messages:Expected Behavior:
tesstrain.sh
should create the box files and proceed with training.Attachments:
I attached all files used: akktrain.zip.
The fonts are hosted here, but for the sake of completeness the
.ttf
-files are included in the archive; they shoud be moved to/usr/share/fonts
.The text was updated successfully, but these errors were encountered: