Would like to help for Burmese/Myanmar language training? #13

herzcthu · 2015-07-03T20:17:33Z

Hello,
I would like to help. I've already cloned all repository. How do I start?

zdenop · 2015-07-03T21:52:37Z

What issue is there with Burmese/Myanmar language?

herzcthu · 2015-07-04T07:32:32Z

We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable.
I would like to know method you've used to generate Burmese training files. Where did you get original data? I can check if it is standard unicode contents or not.

minthanthtoo · 2015-08-17T09:18:34Z

I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word.
Myanmar language users do not much care about adding a 'space' character between words; this results in false perception of two or more words as a single word, when you assume all characters between 2 'space' characters as a word. I found most word lists here ,especially bi-grams holds too long Myanmar phrases. That makes the wordlists unusable and the results of its appliction is totally unpredictable
So I think you need to extract data from a source using dictionary-lookup approach. Of course, you need to build your own wordlist manually or use those made by others.
Also Myanmar language is a syllable-based language; that is one or more Myanmar letters combine to form a syllable and one or more syllables join to form a word. So it is advisable to detect syllables so that you can gain much performance improvement in dictionary-looking up.

Shreeshrii · 2017-02-04T13:25:51Z

@herzcthu @minthanthtoo

Please add some good sources of standard unicode fonts and sample texts and word frequency lists to #46

herzcthu · 2017-02-11T13:16:29Z

https://my.wikipedia.org/
All contents on wikipedia are in standard unicode font.

nengine · 2017-02-13T22:46:52Z

@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data?

Shreeshrii · 2017-02-14T04:53:49Z

Please see Ray's comment at
tesseract-ocr/tesseract#654 (comment)

about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata.

nengine · 2017-03-06T01:51:15Z

Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training?

Shreeshrii · 2017-03-10T09:43:10Z

Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process.

Shreeshrii · 2017-03-30T12:06:23Z

tesseract-ocr/tesseract#654

will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).

Shreeshrii · 2017-03-30T12:32:23Z

copied from #46

@herzcthu commented

Myanmar wordlists
https://github.com/kanaung/wordlists

https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true

Is this a good wordlist in standard unicode for mynamar?

nengine · 2017-03-30T15:04:16Z

These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities.

amitdo · 2017-03-30T15:34:06Z

Manually Constructed Context-Free Grammar For Myanmar Syllable Structure
http://www.aclweb.org/anthology/E12-3004

amitdo · 2017-03-30T16:23:25Z

Representing Myanmar in Unicode
Details and Examples
http://unicode.org/notes/tn11/myanmar_uni-v2.pdf
http://www.tuninst.net/LINGUISTICS/myanmar-unicode/myanmar-unicode.htm

Creating and Supporting OpenType Fonts for Myanmar Script
https://www.microsoft.com/typography/OpenTypeDev/myanmar/intro.htm

Myanmar script notes
http://rishida.net/scripts/myanmar/#shaping

https://www.researchgate.net/publication/253745697_A_Rule-based_Syllable_Segmentation_of_Myanmar_Text

Shreeshrii · 2017-03-31T10:55:05Z

@theraysmith

I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar.

training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???


=== Phase UP: Generating unicharset and unichar properties files ===
[Fri Mar 31 16:07:02 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.OzCvDLSWBp/mya/ /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box /tmp/tmp.OzCvDLSW
Bp/mya/mya.Myanmar_Text.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text.exp0.box
Wrote unicharset file /tmp/tmp.OzCvDLSWBp/mya//unicharset.
[Fri Mar 31 16:07:05 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -O /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -X /tmp/tmp
.OzCvDLSWBp/mya/mya.xheights --script_dir=../langdata
Loaded unicharset of size 217 from file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
Setting unichar properties
Other case È of è is not in unicharset
Other case Ë of ë is not in unicharset
Warning: properties incomplete for index 4 = ယ်
Warning: properties incomplete for index 5 = လ်
Warning: properties incomplete for index 8 = မ်
Warning: properties incomplete for index 10 = င်
Warning: properties incomplete for index 16 = မှ
Warning: properties incomplete for index 22 = ရှ
Warning: properties incomplete for index 28 = ဖွဲ့
Warning: properties incomplete for index 30 = ည်
Warning: properties incomplete for index 36 = ပ်
Warning: properties incomplete for index 37 = ဖြ
Warning: properties incomplete for index 38 = င့်
Warning: properties incomplete for index 41 = က်
Warning: properties incomplete for index 42 = နှာ
Warning: properties incomplete for index 43 = ည်း
Warning: properties incomplete for index 44 = တ်
Warning: properties incomplete for index 45 = မှု
Warning: properties incomplete for index 47 = မ်း
Warning: properties incomplete for index 50 = ခြ
Warning: properties incomplete for index 51 = င်း
Warning: properties incomplete for index 52 = ကြော
Warning: properties incomplete for index 53 = နှို
Warning: properties incomplete for index 54 = ချွ
Warning: properties incomplete for index 63 = ပွဲ
Warning: properties incomplete for index 64 = တွေ
Warning: properties incomplete for index 65 = မှာ
Warning: properties incomplete for index 66 = ဆွေး
Warning: properties incomplete for index 67 = နွေး
Warning: properties incomplete for index 73 = ထွေ
Warning: properties incomplete for index 78 = မြ
Warning: properties incomplete for index 79 = စ်
Warning: properties incomplete for index 80 = မြို့
Warning: properties incomplete for index 83 = န်
Warning: properties incomplete for index 86 = ကွ
Warning: properties incomplete for index 89 = သွ
Warning: properties incomplete for index 92 = ဖ်
Warning: properties incomplete for index 96 = ခြေ
Warning: properties incomplete for index 100 = မျှ
Warning: properties incomplete for index 101 = ဂြို
Warning: properties incomplete for index 102 = ဟ်
Warning: properties incomplete for index 103 = တွ
Warning: properties incomplete for index 110 = ရှု
Warning: properties incomplete for index 119 = ညွှ
Warning: properties incomplete for index 120 = န်း
Warning: properties incomplete for index 123 = ကြ
Warning: properties incomplete for index 124 = ည့်
Warning: properties incomplete for index 125 = နှ
Warning: properties incomplete for index 126 = ထွ
Warning: properties incomplete for index 130 = ရှိ
Warning: properties incomplete for index 132 = ကြို
Warning: properties incomplete for index 140 = ဉ်
Warning: properties incomplete for index 150 = လှ
Warning: properties incomplete for index 151 = သွား
Warning: properties incomplete for index 153 = ထွာ
Warning: properties incomplete for index 154 = ထွား
Warning: properties incomplete for index 157 = ဖွံ့
Warning: properties incomplete for index 158 = မွ
Warning: properties incomplete for index 159 = လျော်
Warning: properties incomplete for index 162 = ပြော
Warning: properties incomplete for index 163 = ထွေး
Warning: properties incomplete for index 164 = ယှ
Warning: properties incomplete for index 168 = ဘွား
Warning: properties incomplete for index 179 = လွ
Warning: properties incomplete for index 182 = န့်
Warning: properties incomplete for index 189 = စွဲ
Warning: properties incomplete for index 192 = ပြီး
Warning: properties incomplete for index 197 = မြေ
Warning: properties incomplete for index 202 = ကွာ
Warning: properties incomplete for index 210 = ရှာ
Warning: properties incomplete for index 211 = ဖွေ
Warning: properties incomplete for index 212 = တွေ့
Warning: properties incomplete for index 214 = ပြ
Warning: properties incomplete for index 215 = ကြာ
Writing unicharset to file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset

mya.Myanmar_Text.exp0.txt
mya.Myanmar_Text_Bold.exp0.txt
mya.unicharset.txt

Shreeshrii · 2017-03-31T11:08:54Z

@herzcthu @nengine @minthanthtoo

Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset
in light of the above warning messages. Do you notice any pattern for the errors?

Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above.

Shreeshrii · 2017-03-31T11:15:15Z

@theraysmith do zwj and zwnj also have to be part of unicharset?

also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj

amitdo · 2017-03-31T11:38:26Z

https://github.com/khzaw/awesome-myanmar-unicode

Shreeshrii · 2017-03-31T11:50:10Z

Syllabification, Normalization and Lexicographic Ordering
of Myanmar Texts using Formal Approaches

http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf

nengine · 2017-03-31T13:54:01Z

I do not see consistent pattern.

Warning: properties incomplete for index 4 = ယ် . ယ် by itself does not have any meaning, but when it is combined with ဘ which becomes ဘယ် it makes sense.
Warning: properties incomplete for index 16 = မှ . မှ by itself does make sense and has a meaning, but not so sure why it is giving a warning.

Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc.

It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example?

Shreeshrii · 2017-03-31T14:33:23Z

I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs.

I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third.

Can you please check whether all of these are required?

http://www.alanwood.net/unicode/myanmar.html

http://www.alanwood.net/unicode/myanmar-extended-a.html

http://www.alanwood.net/unicode/myanmar-extended-b.html

nengine · 2017-03-31T17:05:27Z

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

Shreeshrii · 2017-03-31T17:22:34Z

>you would rarely see them on the web.

What about in books / documents that need to be OCRed?

nengine · 2017-03-31T17:48:57Z

Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.

________________________________ From: Shreeshrii <notifications@github.com> Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention

Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13) >you would rarely see them on the web.

What about in books / documents that need to be OCRed? - excuse the brevity, sent from mobile On 31-Mar-2017 10:35 PM, "nengine" <notifications@github.com> wrote: There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#13 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>.

theraysmith · 2017-04-14T23:51:48Z

Please take a look at this reference: http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf Table 16-3. The text says "Characters occur in the relative order shown in Table 16-3" which I do not believe to be completely correct. Part of the problem is that a lot of the characters are not even in this table! Although it is possible to guess which group the extensions belong to, I'm not convinced I have it correct. I have some code that implements this table plus my guesses to add the extensions, but it isn't ready for committing to github just yet. The problem is that I need to exclude the incorrectly formatted text (that uses the non-standard fonts), but be sure that no correctly formatted text is dropped.

…

On Fri, Mar 31, 2017 at 10:49 AM, nengine ***@***.***> wrote: Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web. ________________________________ From: Shreeshrii ***@***.***> Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13) >>you would rarely see them on the web. What about in books / documents that need to be OCRed? - excuse the brevity, sent from mobile On 31-Mar-2017 10:35 PM, "nengine" ***@***.***> wrote: There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13# issuecomment-290770704>, or mute the thread <https://github.com/notifications/unsubscribe- auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ tesseract-ocr/langdata#13#issuecomment-290774836>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056cZdXfwKdFL1EpH01k8FRXDHh6NTks5rrTyNgaJpZM4FRqc3> .

-- Ray.

herzcthu · 2017-04-27T03:02:06Z

I've checked characters in Myanmar.unicharset file. All characters seem correct.

Shreeshrii · 2017-07-14T05:21:52Z

Please see tesseract-ocr/tesseract#995 (comment)

When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match

For instance, there is a big table in the unicode standard for Myanmar, (
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover
any of the extension Myanmar characters, and isn't explicit about whether
the table represents a specific valid order or not. The existence of a lot
of legacy Myanmar text on the web that is designed for non-compliant fonts
doesn't help make it easier to determine whether the filter is correct.

theraysmith · 2017-07-14T18:23:08Z

Please see code at: https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp

…

On Thu, Jul 13, 2017 at 10:21 PM, Shreeshrii ***@***.***> wrote: Please see tesseract-ocr/tesseract#995 (comment) <tesseract-ocr/tesseract#995 (comment)> For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056YM0MGz07l7tSTpJWUPO5bNE1W6rks5sNvrygaJpZM4FRqc3> .

-- Ray.

Shreeshrii · 2017-08-08T01:46:30Z

@herzcthu @nengine @minthanthtoo

Please test with the new traineddata in tessdata/best directory and provide feedback.

herzcthu · 2017-08-09T17:24:35Z

I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later.

nengine · 2017-08-09T20:26:51Z

I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link?

Shreeshrii · 2017-08-10T03:22:05Z

You can use new windows binaries for 4.0 linked from https://github.com/UB-Mannheim/tesseract/wiki ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 10, 2017 at 1:56 AM, nengine ***@***.***> wrote: I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6HxqCOXKZDOZJnKpjwqm-GSf-qQks5sWhYMgaJpZM4FRqc3> .

herzcthu · 2017-08-10T13:43:32Z

I've attached first screenshot I've tested.
Upper part is image I've tested and lower part is OCR converted text.
Words between two adjacent same color points are missing or incorrect.
If you need code point comparison between source image and output text. I can provide later.

Shreeshrii · 2017-08-10T15:50:00Z

It would be helpful if you can point out to any pattern that you notice in the errors. I think one that I notice is that words are getting dropped in the OCRed text (missing). ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 10, 2017 at 7:13 PM, Sithu Thwin ***@***.***> wrote: [image: both] <https://user-images.githubusercontent.com/3231665/29173007-d4bc484e-7e07-11e7-9036-0462da3ac580.png> I've attached first screenshot I've tested. Upper part is image I've tested and lower part is OCR converted text. Words between two adjacent same color points are missing or incorrect. If you need code point comparison between source image and output text. I can provide later. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oyNJ4LqZqWLODpuyuvJISRw7EC2qks5sWwkGgaJpZM4FRqc3> .

kyawswa · 2017-10-12T15:22:52Z

Hello, I would like to share what I found in Myanmar training data.

I used tesseract version 3.04.

I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png.

test result for ocr_sample_1 image is below. I marked with red point to see different.
image_file

Result

And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English.

Image_file

Result

And then I download the myanmar langdata from github.(https://github.com/tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text.
For example,

#first arrow head line
It should be "ရုတ်ရုတ်သဲသဲ".

#Second arrow head line
It should be "သစ်တောများကုန်".

#third arrow head line
should be "ပညာရေးစနစ်". so on.

So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions.
-Exporting mya.traineddata is based on those file?
-How can I know which file is used for what? eg. what is mya.punc file?
-And where did you get those data?
-Is there any format or rule to put data into those files?

Could you please explain me about those files?
I am also willing to improve Myanmar language in OCR.

Thanks you for your contribution.

Shreeshrii · 2017-10-12T16:24:54Z

Please also try tesseract 4.0alpha which might have improved results. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Oct 12, 2017 at 8:52 PM, kyawswar ***@***.***> wrote: Hello, I would like to share what I found in Myanmar training data. I used tesseract version 3.04. I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png. *test result for ocr_sample_1 image is below. I marked with red point to see different.* image_file [image: ocr_sample_1] <https://user-images.githubusercontent.com/4832700/31503684-54c914c6-af96-11e7-8cb5-ebfbe85dc1c4.jpg> Result [image: screenshot from ocr_sample_1] <https://user-images.githubusercontent.com/4832700/31503452-b8ce91c2-af95-11e7-96fa-256a19394daf.png> And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English. Image_file [image: ocr_sample_2] <https://user-images.githubusercontent.com/4832700/31503742-7748882e-af96-11e7-82e1-d4189ec553d0.png> Result [image: screenshot from ocr_sample_2] <https://user-images.githubusercontent.com/4832700/31503479-c62a7e26-af95-11e7-93d5-442ddb1fd637.png> And then I download the myanmar langdata from github.(https://github.com/ tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text. For example, [image: screenshot from 2017-10-12 21-01-05] <https://user-images.githubusercontent.com/4832700/31503553-f608e970-af95-11e7-910a-af7ceeb2852d.png> #first arrow head line It should be "ရုတ်ရုတ်သဲသဲ". #Second arrow head line It should be "သစ်တောများကုန်". #third arrow head line should be "ပညာရေးစနစ်". so on. So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions. -Exporting mya.traineddata is based on those file? -How can I know which file is used for what? eg. what is mya.punc file? -And where did you get those data? -Is there any format or rule to put data into those files? Could you please explain me about those files? I am also willing to improve Myanmar language in OCR. Thanks you for your contribution. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o2vO5yL2GI7dhu_pWBoDZhKv9iS5ks5sri7PgaJpZM4FRqc3> .

kyawswa · 2017-10-13T03:09:49Z

Yes, I used UB Mannheim with tesseract 4.0.0-alpha.20170804. I test with the following image files. The following is test result.

ocr_sample_1.png

Result

%%%%%
©05080×05
5082:40:82! 0=2405005$2³050

ocr_sample_2.png

Result

ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး

Thanks.

Shreeshrii · 2017-10-13T08:14:02Z

langdata repo has not been updated for 4.0x. You can extract the wordlist from the tessdata_best traineddata file. Use the commands (please lookup the syntax) combine_tessdata -u .... dawg2wordlist ... to see the version of files used for 4.0 You can compare this wordlist to the wordlist in langdata for spelling etc. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Oct 13, 2017 at 8:39 AM, kyawswar ***@***.***> wrote: Yes, I used tesseract 4.0.0-alpha.20170804. I test with the following image files. The following is test result. ocr_sample_1.png [image: ocr_sample_1] <https://user-images.githubusercontent.com/4832700/31528553-fff13d48-aff9-11e7-9fca-987a0e68c90c.png> Result %%%%% ©05080×05 5082:40:82! 0=2405005$2³050 ocr_sample_2.png [image: ocr_sample_2] <https://user-images.githubusercontent.com/4832700/31528592-396454f2-affa-11e7-9139-f6954eba8ef4.png> Result ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o2fsRUsCSdN8ou-Yk-WMWrzlwieKks5srtR_gaJpZM4FRqc3> .

kyawswa · 2017-10-13T10:02:03Z

Yes, it works perfectly on tessdata_best. But after I checking wordlist, there are many misspelling and incorrect data. I point out the outstanding misspellings. Please see the following attachment.

The most of data from following link is not included in this tessdata_best wordlist.
https://github.com/kanaung/wordlists

I would like to know where did u get that data.

How can I contribute to update those incorrect data?
Thanks.

pndaza · 2018-04-02T02:12:19Z

myanmar traineddata 4.0 does not recognized for the following chars.
၊(u104a)
။(u104b)
၌(u104c)
၍(u104d)
၎(u104e)
၏(u104f)

I unpacked and checked unicharset.
I not found these char in unicharset.

Also unicharset-extractor does not produce these chars.

kpfoley · 2018-12-12T23:50:07Z

I'm just reading through this thread and have a few pointers, some maybe a little repetitive - hope this helps!

Unicode range: just about everything in contemporary use is in the 1000-104F range. The second half of the base range is for special characters needed for writing a number of other languages spoken in Myanmar but also for some Pali words that might be needed, for example, for religious and historical texts. I'm not sure whether users in Myanmar would rather see a model that supports the full range or a more compact model trained on the labels in the first half of the range, which are MUCH more frequently used. In any case training on the 1000-104F range will probably cover > 99.9% of the data (just a guess).
Word lists - the best word list available to researchers AFAIK is a 133k word list maintained by the Myanmar Language Commission (MLC). This is basically just a list of all the words from the large official dictionary of the language. I don't think the word list is publicly available, and it doesn't have any proper nouns, but it would be a good resource if it could be made available for this project. The kanaung word list linked above by @herzcthu is much better than the mya wordlist and I think it is based on an old software release of a "correct spelling" word list and has since incorporated words and place names from other sources like the postal system. I think there are a good deal of non-words and very rare words in that list, though, which maybe limits its usefulness.
Words and spacing - as mentioned above, Burmese / Myanmar doesn't use spaces between every word in its writing system. Spaces are thrown in as needed for typography and usually between multi-word phrases. So the best options for a language model are probably either a character-based language model or a model that incorporates word segmentation (prediction of spaces) into the pipeline. I'm not sure if either of these are possible with Tesseract 4.0. As somebody already mentioned above, the lack of spacing is probably why there are >500,000 words in the mya.wordlist file. As of right now I think the state of the art in Burmese word segmentation (breaking non-spaced continuous text into words) is around 99 percent accuracy -- it's not perfect but it's pretty good.
Syllable segmentation, such as Ye Kyaw Thu's script here https://github.com/ye-kyaw-thu/myPOS/tree/master/corpus-draft-ver-1.0, might also be simpler and more effective than a character-level language model in the absence of word segmentation. A few lines of regex can capture the boundaries between syllables, in which case the language model approach can be similar to what you might use for Chinese (sequences of syllables instead of sequences of words).
Unicode and Zawgyi - the choice between Unicode and Zawgyi is controversial in Myanmar, and Zawgyi is much more popular, with Unicode maybe making some inroads. The problem with Zawgyi, in addition to being non-standard, is that it takes up random spaces in the shared Myanmar unicode range, including spaces for other languages spoken in Myanmar, so it breaks not just Burmese but also the entire extended Myanmar unicode range. It's also frustrating because it used to be difficult to cleanly convert from Zawgyi to Unicode and back, and it still doesn't work perfectly because Zawgyi hides many typing errors by superimposing the same typed letter without advancing the cursor. For this reason there is a lot of corrupted Burmese language text data on the web that may have been converted to unicode at some point (or could be converted) but it's difficult to catch all of the errors left over from the original typed Zawgyi input. Wikipedia has a good explainer here on the two encoding standards: https://my.wikipedia.org/wiki/Wikipedia:Font#Why_not_Zawgyi? The main thing to watch out for here is not to accidentally feed Zawgyi text into the training data, because with so many overlapping codepoints it could wreck the accuracy of the model.

Shreeshrii · 2018-12-14T15:00:17Z

Thank you for the detailed notes. Please review the source training data in langdata_lstm repo also.

Shreeshrii · 2019-02-20T12:42:54Z

Please test the traineddata at https://github.com/Shreeshrii/tessdata_shreetest/blob/master/mya430000.traineddata

and let me know whether it is an improvement over the existing traineddata files.

herzcthu · 2019-03-04T16:50:58Z

Hi Shreeshrii,
I've tested your traineddata. It is a little improved over existing traineddata in tesseract 4.0 beta.
Especially it can detect punctuation and non-burmese characters better.

BTW, I'm trying to train myself, I'm generating lots of box and tif files for only one font. Is that a good idea to have many files for single font? Or should I make it only one box and one tif file.
Currently I have more than 1000 files.

thanks and regards,
Sithu

Shreeshrii · 2019-03-05T04:03:23Z

The amount of training data you need depends on the type of training that you are planning to do. eg. from scratch, replace a layer, plus minus, etc.

I think multiple files for single font may be ok. How are you generating these files?

You should try to keep approximately the same number of lines in each file so that all samples are used in a uniform way for training.

Shreeshrii · 2019-03-05T04:16:07Z

I had used
'Myanmar Khyay'
'Myanmar Sans Pro'
'Myanmar Text'
'Noto Sans Myanmar' \

Which one is a more representative font out of these for training and testing?

herzcthu · 2019-03-05T14:15:17Z

I took one paragraph from wikipedia. Make screenshots with all fonts you have mentioned.
Noto Sans Myanmar has best result.
There is new fonts which will be used in officials documents. It is called Pyidaungsu
You can download here https://www.unicode.today/fonts-download/

I'm creating box and tif files using text2image binary from training.
I collected 1 millions unicode text lines from wikipedia and 3 famous news websites. Creating box files from that contents.

Shreeshrii · 2019-03-06T08:30:08Z

@herzcthu Thanks for the info about the new font.

If you do 'replace layer' type of training, you can get by with fewer lines.

Keep posting about your progress with training.

herzcthu · 2019-03-08T15:57:20Z

I'm stuck at unicharset extractor.
I get one unicharset file. But when I open that file, I'm seeing some unusual combination of characters which is not possible to exist in Burmese scripts. I wonder if this kind of junk can affect training.
I've attached unicharset file I got. output_unicharset.txt

Here is some sample which is not usual

တ္မြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 514 0 514 တ္မြ	# တ္မြ [1010 1039 1019 103c ]x
တ္က်ေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 515 0 515 တ္က်ေ	# တ္က်ေ [1010 1039 1000 103a 1031 ]x
ဥ္တြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 516 0 516 ဥ္တြ	# ဥ္တြ [1025 1039 1010 103c ]x
င္လေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 517 0 517 င္လေ	# င္လေ [1004 1039 101c 1031 ]x
ည္ဖြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 518 0 518 ည္ဖြ	# ည္ဖြ [100a 1039 1016 103c ]x

Shreeshrii · 2019-03-08T16:29:52Z

Use NORM_MODE="3" with unicharset extractor command.

herzcthu · 2019-03-12T16:07:19Z

Tried norm_mode 2 and 3. Both has missing vowels and medial.

n-92 · 2020-04-07T18:32:31Z

Hi,

Any further progress?

GmGniap · 2020-09-06T01:27:18Z

Please mention or let me know if something you need help for checking/fixing Burmese datasets, I'm gladly to be part of it. I've some experience in Python & Typescript. Cheers! all for helping to improve Myanmar Language in machines.

glxwine · 2024-06-19T09:30:27Z

I am new to tesseract. Recently I tried Myanmar language. It is still not perfected yet. I searched the training of data set and found this thread. However, it seems to be very old and no recent updates. I am not familiar with "how to train the data sets", but I know the language. Is there anyway that we can do to improve the Myanmar language? I also wish to understand how the training is done.

stweil · 2024-06-21T11:27:26Z

The training requires training data = lots of line images (*.png) with corresponding transcription (*.gt.txt). The original training used generated (artificial) line images, but meanwhile newer trainings for other scripts are often based on real line images from scanned books or newspapers. It's also possible to use a mix of artificial and real line images. You need as many lines as possible, and the text must cover all relevant glyphs (characters).

With enough lines for training, you can use tesstrain for the training.

Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder.

Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/.

Make sure to document your training process and to publish your training data if you want to submit the result for the inclusion in the tesseract-ocr repositories.

glxwine · 2024-06-24T13:23:42Z

Thank you for providing the steps related to training. I think I will have to try a lot to understand the steps. For the moment, I am more on to providing the training data (.png and .gt.txt). I know that even that one will take a lot of pairs. However, I am willing to do more on that if there is anyone who can use these to train it. I believe that Myanmar script (image) is not much complicated (like the German script example). Myanmar words are generally of the same shape except the size might proportionately increase/decrease. If the basic ones can be identified, the result will be improved. That is what I think. Apologies, if what I said is too simple. What I meant is that myanmar language shapes are quite consistent and different styles are rarely used, and also I am willing to help with (.png and .gt.txt) if I were given more detail requirements for providing these. thanks and best regards,

…

On Fri, 21 Jun 2024 at 17:57, Stefan Weil ***@***.***> wrote: The training requires training data = lots of line images (*.png) with corresponding transcription (*.gt.txt). The original training used generated (artificial) line images, but meanwhile newer trainings for other scripts are often based on real line images from scanned books or newspapers. It's also possible to use a mix of artificial and real line images. You need as many lines as possible, and the text must cover all relevant glyphs (characters). With enough lines for training, you can use tesstrain <https://github.com/tesseract-ocr/tesstrain/> for the training. Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder . Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/ . — Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BBGCV7H2E22XXWCBXPN2OSDZIQE3LAVCNFSM6AAAAABJRVPIOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGU3DONZRGQ> . You are receiving this because you commented.Message ID: ***@***.***>

Shreeshrii mentioned this issue Feb 4, 2017

Myanmar Resources #46

Closed

Shreeshrii mentioned this issue Mar 30, 2017

Q&A: Indic - length of the compressed codes tesseract-ocr/tesseract#654

Open

tesseract-ocr deleted a comment from bykovman Apr 26, 2021

Would like to help for Burmese/Myanmar language training? #13

Would like to help for Burmese/Myanmar language training? #13

Comments

herzcthu commented Jul 3, 2015

zdenop commented Jul 3, 2015

herzcthu commented Jul 4, 2015

minthanthtoo commented Aug 17, 2015

Shreeshrii commented Feb 4, 2017

herzcthu commented Feb 11, 2017

nengine commented Feb 13, 2017

Shreeshrii commented Feb 14, 2017

nengine commented Mar 6, 2017

Shreeshrii commented Mar 10, 2017

Shreeshrii commented Mar 30, 2017

Shreeshrii commented Mar 30, 2017 • edited Loading

copied from #46

nengine commented Mar 30, 2017

amitdo commented Mar 30, 2017

amitdo commented Mar 30, 2017

Shreeshrii commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017

amitdo commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017

nengine commented Mar 31, 2017 • edited Loading

Shreeshrii commented Mar 31, 2017

nengine commented Mar 31, 2017

Shreeshrii commented Mar 31, 2017 via email • edited Loading

nengine commented Mar 31, 2017 via email

theraysmith commented Apr 14, 2017 via email

herzcthu commented Apr 27, 2017

Shreeshrii commented Jul 14, 2017 • edited Loading

theraysmith commented Jul 14, 2017 via email

Shreeshrii commented Aug 8, 2017

herzcthu commented Aug 9, 2017

nengine commented Aug 9, 2017

Shreeshrii commented Aug 10, 2017 via email

herzcthu commented Aug 10, 2017

Shreeshrii commented Aug 10, 2017 via email

kyawswa commented Oct 12, 2017

Shreeshrii commented Oct 12, 2017 via email

kyawswa commented Oct 13, 2017 • edited Loading

Shreeshrii commented Oct 13, 2017 via email

kyawswa commented Oct 13, 2017 • edited Loading

pndaza commented Apr 2, 2018

kpfoley commented Dec 12, 2018 • edited Loading

Shreeshrii commented Dec 14, 2018

Shreeshrii commented Feb 20, 2019

herzcthu commented Mar 4, 2019

Shreeshrii commented Mar 5, 2019

Shreeshrii commented Mar 5, 2019

herzcthu commented Mar 5, 2019

Shreeshrii commented Mar 6, 2019

herzcthu commented Mar 8, 2019 • edited Loading

Shreeshrii commented Mar 8, 2019

herzcthu commented Mar 12, 2019

n-92 commented Apr 7, 2020

GmGniap commented Sep 6, 2020

glxwine commented Jun 19, 2024

stweil commented Jun 21, 2024 • edited Loading

glxwine commented Jun 24, 2024 via email

Shreeshrii commented Mar 30, 2017 •

edited

Loading

nengine commented Mar 31, 2017 •

edited

Loading

Shreeshrii commented Mar 31, 2017 via email •

edited

Loading

Shreeshrii commented Jul 14, 2017 •

edited

Loading

kyawswa commented Oct 13, 2017 •

edited

Loading

kyawswa commented Oct 13, 2017 •

edited

Loading

kpfoley commented Dec 12, 2018 •

edited

Loading

herzcthu commented Mar 8, 2019 •

edited

Loading

stweil commented Jun 21, 2024 •

edited

Loading