-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would like to help for Burmese/Myanmar language training? #13
Comments
What issue is there with Burmese/Myanmar language? |
We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable. |
I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word. |
Please add some good sources of standard unicode fonts and sample texts and word frequency lists to #46 |
https://my.wikipedia.org/ |
@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data? |
Please see Ray's comment at about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata. |
Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training? |
Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process. |
|
copied from #46@herzcthu commented Myanmar wordlists Is this a good wordlist in standard unicode for mynamar? |
These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities. |
Manually Constructed Context-Free Grammar For Myanmar Syllable Structure |
Representing Myanmar in Unicode Creating and Supporting OpenType Fonts for Myanmar Script Myanmar script notes |
I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar. training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???
mya.Myanmar_Text.exp0.txt |
@herzcthu @nengine @minthanthtoo Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above. |
@theraysmith do zwj and zwnj also have to be part of unicharset? also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj |
Syllabification, Normalization and Lexicographic Ordering http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf |
I do not see consistent pattern.
Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc. It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example? |
I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs. I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third. Can you please check whether all of these are required? http://www.alanwood.net/unicode/myanmar.html |
There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good. |
>you would rarely see them on the web.
What about in books / documents that need to be OCRed?
|
Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.
________________________________
From: Shreeshrii <notifications@github.com>
Sent: Friday, March 31, 2017 1:22 PM
To: tesseract-ocr/langdata
Cc: nengine; Mention
Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)
>you would rarely see them on the web.
What about in books / documents that need to be OCRed?
- excuse the brevity, sent from mobile
On 31-Mar-2017 10:35 PM, "nengine" <notifications@github.com> wrote:
There are 8 major ethnic groups in Myanmar, so I believe extended A and B
are added for that reason. So, for completeness I think it should be added,
but you would rarely see them on the web. Unicode range 1000 - 104F is
already good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#13 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>.
|
Please take a look at this reference:
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf
Table 16-3.
The text says "Characters occur in the relative order shown in Table 16-3"
which I do not believe to be completely correct.
Part of the problem is that a lot of the characters are not even in this
table!
Although it is possible to guess which group the extensions belong to, I'm
not convinced I have it correct.
I have some code that implements this table plus my guesses to add the
extensions, but it isn't ready for committing to github just yet.
The problem is that I need to exclude the incorrectly formatted text (that
uses the non-standard fonts), but be sure that no correctly formatted text
is dropped.
…On Fri, Mar 31, 2017 at 10:49 AM, nengine ***@***.***> wrote:
Yes, extended A and B should also be added for completeness as I said, but
as far as for training samples, it is almost non existence on the web.
________________________________
From: Shreeshrii ***@***.***>
Sent: Friday, March 31, 2017 1:22 PM
To: tesseract-ocr/langdata
Cc: nengine; Mention
Subject: Re: [tesseract-ocr/langdata] Would like to help for
Burmese/Myanmar language training? (#13)
>>you would rarely see them on the web.
What about in books / documents that need to be OCRed?
- excuse the brevity, sent from mobile
On 31-Mar-2017 10:35 PM, "nengine" ***@***.***> wrote:
There are 8 major ethnic groups in Myanmar, so I believe extended A and B
are added for that reason. So, for completeness I think it should be added,
but you would rarely see them on the web. Unicode range 1000 - 104F is
already good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13#
issuecomment-290770704>,
or mute the thread
<https://github.com/notifications/unsubscribe-
auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
tesseract-ocr/langdata#13#issuecomment-290774836>, or mute the
thread<https://github.com/notifications/unsubscribe-auth/
AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056cZdXfwKdFL1EpH01k8FRXDHh6NTks5rrTyNgaJpZM4FRqc3>
.
--
Ray.
|
I've checked characters in Myanmar.unicharset file. All characters seem correct. |
Please see tesseract-ocr/tesseract#995 (comment)
|
Please see code at:
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp
…On Thu, Jul 13, 2017 at 10:21 PM, Shreeshrii ***@***.***> wrote:
Please see tesseract-ocr/tesseract#995 (comment)
<tesseract-ocr/tesseract#995 (comment)>
For instance, there is a big table in the unicode standard for Myanmar, (
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't
cover
any of the extension Myanmar characters, and isn't explicit about whether
the table represents a specific valid order or not. The existence of a lot
of legacy Myanmar text on the web that is designed for non-compliant fonts
doesn't help make it easier to determine whether the filter is correct.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056YM0MGz07l7tSTpJWUPO5bNE1W6rks5sNvrygaJpZM4FRqc3>
.
--
Ray.
|
@herzcthu @nengine @minthanthtoo Please test with the new traineddata in tessdata/best directory and provide feedback. |
I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later. |
I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link? |
You can use new windows binaries for 4.0 linked from
https://github.com/UB-Mannheim/tesseract/wiki
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Aug 10, 2017 at 1:56 AM, nengine ***@***.***> wrote:
I like to test it but not so sure how to do it. I have Windows 10
installed. Could you please point to the documentation link?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6HxqCOXKZDOZJnKpjwqm-GSf-qQks5sWhYMgaJpZM4FRqc3>
.
|
It would be helpful if you can point out to any pattern that you notice in
the errors.
I think one that I notice is that words are getting dropped in the OCRed
text (missing).
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Aug 10, 2017 at 7:13 PM, Sithu Thwin ***@***.***> wrote:
[image: both]
<https://user-images.githubusercontent.com/3231665/29173007-d4bc484e-7e07-11e7-9036-0462da3ac580.png>
I've attached first screenshot I've tested.
Upper part is image I've tested and lower part is OCR converted text.
Words between two adjacent same color points are missing or incorrect.
If you need code point comparison between source image and output text. I
can provide later.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_oyNJ4LqZqWLODpuyuvJISRw7EC2qks5sWwkGgaJpZM4FRqc3>
.
|
Hello, I would like to share what I found in Myanmar training data. I used tesseract version 3.04. I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png. test result for ocr_sample_1 image is below. I marked with red point to see different. And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English. And then I download the myanmar langdata from github.(https://github.com/tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text. #first arrow head line #Second arrow head line #third arrow head line So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions. Could you please explain me about those files? Thanks you for your contribution. |
Please also try tesseract 4.0alpha which might have improved results.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Oct 12, 2017 at 8:52 PM, kyawswar ***@***.***> wrote:
Hello, I would like to share what I found in Myanmar training data.
I used tesseract version 3.04.
I think It still need to improve a lot. Firstly I would like to tell the
my test result. I tried to test with three image with myanmar language:
ocr_sample_1.png and ocr_sample_2.png.
*test result for ocr_sample_1 image is below. I marked with red point to
see different.*
image_file
[image: ocr_sample_1]
<https://user-images.githubusercontent.com/4832700/31503684-54c914c6-af96-11e7-8cb5-ebfbe85dc1c4.jpg>
Result
[image: screenshot from ocr_sample_1]
<https://user-images.githubusercontent.com/4832700/31503452-b8ce91c2-af95-11e7-96fa-256a19394daf.png>
And the second ocr_sample_2 image result is below. It's result is
completely worng. It means "how are you" in English.
Image_file
[image: ocr_sample_2]
<https://user-images.githubusercontent.com/4832700/31503742-7748882e-af96-11e7-82e1-d4189ec553d0.png>
Result
[image: screenshot from ocr_sample_2]
<https://user-images.githubusercontent.com/4832700/31503479-c62a7e26-af95-11e7-93d5-442ddb1fd637.png>
And then I download the myanmar langdata from github.(https://github.com/
tesseract-ocr/langdata). I found 7 files. After I check those file, most
of the contents are incorrect, misspelling. I would like to show the one or
two incorrect data from one of those file named mya.training_text.
For example,
[image: screenshot from 2017-10-12 21-01-05]
<https://user-images.githubusercontent.com/4832700/31503553-f608e970-af95-11e7-910a-af7ceeb2852d.png>
#first arrow head line
It should be "ရုတ်ရုတ်သဲသဲ".
#Second arrow head line
It should be "သစ်တောများကုန်".
#third arrow head line
should be "ပညာရေးစနစ်". so on.
So I would like to contribute to make the correction for these 7 files.
And I would like to ask the following questions.
-Exporting mya.traineddata is based on those file?
-How can I know which file is used for what? eg. what is mya.punc file?
-And where did you get those data?
-Is there any format or rule to put data into those files?
Could you please explain me about those files?
I am also willing to improve Myanmar language in OCR.
Thanks you for your contribution.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o2vO5yL2GI7dhu_pWBoDZhKv9iS5ks5sri7PgaJpZM4FRqc3>
.
|
langdata repo has not been updated for 4.0x.
You can extract the wordlist from the tessdata_best traineddata file.
Use the commands (please lookup the syntax)
combine_tessdata -u ....
dawg2wordlist ...
to see the version of files used for 4.0
You can compare this wordlist to the wordlist in langdata for spelling etc.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Oct 13, 2017 at 8:39 AM, kyawswar ***@***.***> wrote:
Yes, I used tesseract 4.0.0-alpha.20170804. I test with the following
image files. The following is test result.
ocr_sample_1.png
[image: ocr_sample_1]
<https://user-images.githubusercontent.com/4832700/31528553-fff13d48-aff9-11e7-9fca-987a0e68c90c.png>
Result
%%%%%
©05080×05
5082:40:82! 0=2405005$2³050
ocr_sample_2.png
[image: ocr_sample_2]
<https://user-images.githubusercontent.com/4832700/31528592-396454f2-affa-11e7-9139-f6954eba8ef4.png>
Result
ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o2fsRUsCSdN8ou-Yk-WMWrzlwieKks5srtR_gaJpZM4FRqc3>
.
|
Yes, it works perfectly on tessdata_best. But after I checking wordlist, there are many misspelling and incorrect data. I point out the outstanding misspellings. Please see the following attachment. The most of data from following link is not included in this tessdata_best wordlist. I would like to know where did u get that data. How can I contribute to update those incorrect data? |
myanmar traineddata 4.0 does not recognized for the following chars. I unpacked and checked unicharset. Also unicharset-extractor does not produce these chars. |
I'm just reading through this thread and have a few pointers, some maybe a little repetitive - hope this helps!
|
Thank you for the detailed notes. Please review the source training data in langdata_lstm repo also. |
Please test the traineddata at https://github.com/Shreeshrii/tessdata_shreetest/blob/master/mya430000.traineddata and let me know whether it is an improvement over the existing traineddata files. |
Hi Shreeshrii, BTW, I'm trying to train myself, I'm generating lots of box and tif files for only one font. Is that a good idea to have many files for single font? Or should I make it only one box and one tif file. thanks and regards, |
The amount of training data you need depends on the type of training that you are planning to do. eg. from scratch, replace a layer, plus minus, etc. I think multiple files for single font may be ok. How are you generating these files? You should try to keep approximately the same number of lines in each file so that all samples are used in a uniform way for training. |
I had used Which one is a more representative font out of these for training and testing? |
I took one paragraph from wikipedia. Make screenshots with all fonts you have mentioned. I'm creating box and tif files using text2image binary from training. |
@herzcthu Thanks for the info about the new font. If you do 'replace layer' type of training, you can get by with fewer lines. Keep posting about your progress with training. |
I'm stuck at unicharset extractor. Here is some sample which is not usual
|
Use |
Tried norm_mode 2 and 3. Both has missing vowels and medial. |
Hi, Any further progress? |
Please mention or let me know if something you need help for checking/fixing Burmese datasets, I'm gladly to be part of it. I've some experience in Python & Typescript. Cheers! all for helping to improve Myanmar Language in machines. |
I am new to tesseract. Recently I tried Myanmar language. It is still not perfected yet. I searched the training of data set and found this thread. However, it seems to be very old and no recent updates. I am not familiar with "how to train the data sets", but I know the language. Is there anyway that we can do to improve the Myanmar language? I also wish to understand how the training is done. |
The training requires training data = lots of line images ( With enough lines for training, you can use tesstrain for the training. Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder. Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/. Make sure to document your training process and to publish your training data if you want to submit the result for the inclusion in the tesseract-ocr repositories. |
Thank you for providing the steps related to training. I think I will have
to try a lot to understand the steps.
For the moment, I am more on to providing the training data (.png and
.gt.txt). I know that even that one will take a lot of pairs. However, I am
willing to do more on that if there is anyone who can use these to train
it. I believe that Myanmar script (image) is not much complicated (like
the German script example). Myanmar words are generally of the same shape
except the size might proportionately increase/decrease. If the basic ones
can be identified, the result will be improved. That is what I think.
Apologies, if what I said is too simple. What I meant is that myanmar
language shapes are quite consistent and different styles are rarely used,
and also I am willing to help with (.png and .gt.txt) if I were given more
detail requirements for providing these.
thanks and best regards,
…On Fri, 21 Jun 2024 at 17:57, Stefan Weil ***@***.***> wrote:
The training requires training data = lots of line images (*.png) with
corresponding transcription (*.gt.txt). The original training used
generated (artificial) line images, but meanwhile newer trainings for other
scripts are often based on real line images from scanned books or
newspapers. It's also possible to use a mix of artificial and real line
images. You need as many lines as possible, and the text must cover all
relevant glyphs (characters).
With enough lines for training, you can use tesstrain
<https://github.com/tesseract-ocr/tesstrain/> for the training.
Examples of training data for Latin script:
https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder
.
Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/
.
—
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBGCV7H2E22XXWCBXPN2OSDZIQE3LAVCNFSM6AAAAABJRVPIOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGU3DONZRGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello,
I would like to help. I've already cloned all repository. How do I start?
The text was updated successfully, but these errors were encountered: