Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would like to help for Burmese/Myanmar language training? #13

Open
herzcthu opened this issue Jul 3, 2015 · 56 comments
Open

Would like to help for Burmese/Myanmar language training? #13

herzcthu opened this issue Jul 3, 2015 · 56 comments

Comments

@herzcthu
Copy link

herzcthu commented Jul 3, 2015

Hello,
I would like to help. I've already cloned all repository. How do I start?

@zdenop
Copy link
Contributor

zdenop commented Jul 3, 2015

What issue is there with Burmese/Myanmar language?

@herzcthu
Copy link
Author

herzcthu commented Jul 4, 2015

We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable.
I would like to know method you've used to generate Burmese training files. Where did you get original data? I can check if it is standard unicode contents or not.

@minthanthtoo
Copy link

I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word.
Myanmar language users do not much care about adding a 'space' character between words; this results in false perception of two or more words as a single word, when you assume all characters between 2 'space' characters as a word. I found most word lists here ,especially bi-grams holds too long Myanmar phrases. That makes the wordlists unusable and the results of its appliction is totally unpredictable
So I think you need to extract data from a source using dictionary-lookup approach. Of course, you need to build your own wordlist manually or use those made by others.
Also Myanmar language is a syllable-based language; that is one or more Myanmar letters combine to form a syllable and one or more syllables join to form a word. So it is advisable to detect syllables so that you can gain much performance improvement in dictionary-looking up.

@Shreeshrii
Copy link
Contributor

@herzcthu @minthanthtoo

Please add some good sources of standard unicode fonts and sample texts and word frequency lists to #46

@herzcthu
Copy link
Author

https://my.wikipedia.org/
All contents on wikipedia are in standard unicode font.

@nengine
Copy link

nengine commented Feb 13, 2017

@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data?

@Shreeshrii
Copy link
Contributor

Please see Ray's comment at
tesseract-ocr/tesseract#654 (comment)

about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata.

@nengine
Copy link

nengine commented Mar 6, 2017

Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training?

@Shreeshrii
Copy link
Contributor

Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process.

@Shreeshrii
Copy link
Contributor

tesseract-ocr/tesseract#654

will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 30, 2017

copied from #46

@herzcthu commented

Myanmar wordlists
https://github.com/kanaung/wordlists


https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true

Is this a good wordlist in standard unicode for mynamar?

@nengine
Copy link

nengine commented Mar 30, 2017

These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities.

@amitdo
Copy link

amitdo commented Mar 30, 2017

Manually Constructed Context-Free Grammar For Myanmar Syllable Structure
http://www.aclweb.org/anthology/E12-3004

@Shreeshrii
Copy link
Contributor

@theraysmith

I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar.

training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???


=== Phase UP: Generating unicharset and unichar properties files ===
[Fri Mar 31 16:07:02 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.OzCvDLSWBp/mya/ /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box /tmp/tmp.OzCvDLSW
Bp/mya/mya.Myanmar_Text.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text.exp0.box
Wrote unicharset file /tmp/tmp.OzCvDLSWBp/mya//unicharset.
[Fri Mar 31 16:07:05 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -O /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -X /tmp/tmp
.OzCvDLSWBp/mya/mya.xheights --script_dir=../langdata
Loaded unicharset of size 217 from file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
Setting unichar properties
Other case È of è is not in unicharset
Other case Ë of ë is not in unicharset
Warning: properties incomplete for index 4 = ယ်
Warning: properties incomplete for index 5 = လ်
Warning: properties incomplete for index 8 = မ်
Warning: properties incomplete for index 10 = င်
Warning: properties incomplete for index 16 = မှ
Warning: properties incomplete for index 22 = ရှ
Warning: properties incomplete for index 28 = ဖွဲ့
Warning: properties incomplete for index 30 = ည်
Warning: properties incomplete for index 36 = ပ်
Warning: properties incomplete for index 37 = ဖြ
Warning: properties incomplete for index 38 = င့်
Warning: properties incomplete for index 41 = က်
Warning: properties incomplete for index 42 = နှာ
Warning: properties incomplete for index 43 = ည်း
Warning: properties incomplete for index 44 = တ်
Warning: properties incomplete for index 45 = မှု
Warning: properties incomplete for index 47 = မ်း
Warning: properties incomplete for index 50 = ခြ
Warning: properties incomplete for index 51 = င်း
Warning: properties incomplete for index 52 = ကြော
Warning: properties incomplete for index 53 = နှို
Warning: properties incomplete for index 54 = ချွ
Warning: properties incomplete for index 63 = ပွဲ
Warning: properties incomplete for index 64 = တွေ
Warning: properties incomplete for index 65 = မှာ
Warning: properties incomplete for index 66 = ဆွေး
Warning: properties incomplete for index 67 = နွေး
Warning: properties incomplete for index 73 = ထွေ
Warning: properties incomplete for index 78 = မြ
Warning: properties incomplete for index 79 = စ်
Warning: properties incomplete for index 80 = မြို့
Warning: properties incomplete for index 83 = န်
Warning: properties incomplete for index 86 = ကွ
Warning: properties incomplete for index 89 = သွ
Warning: properties incomplete for index 92 = ဖ်
Warning: properties incomplete for index 96 = ခြေ
Warning: properties incomplete for index 100 = မျှ
Warning: properties incomplete for index 101 = ဂြို
Warning: properties incomplete for index 102 = ဟ်
Warning: properties incomplete for index 103 = တွ
Warning: properties incomplete for index 110 = ရှု
Warning: properties incomplete for index 119 = ညွှ
Warning: properties incomplete for index 120 = န်း
Warning: properties incomplete for index 123 = ကြ
Warning: properties incomplete for index 124 = ည့်
Warning: properties incomplete for index 125 = နှ
Warning: properties incomplete for index 126 = ထွ
Warning: properties incomplete for index 130 = ရှိ
Warning: properties incomplete for index 132 = ကြို
Warning: properties incomplete for index 140 = ဉ်
Warning: properties incomplete for index 150 = လှ
Warning: properties incomplete for index 151 = သွား
Warning: properties incomplete for index 153 = ထွာ
Warning: properties incomplete for index 154 = ထွား
Warning: properties incomplete for index 157 = ဖွံ့
Warning: properties incomplete for index 158 = မွ
Warning: properties incomplete for index 159 = လျော်
Warning: properties incomplete for index 162 = ပြော
Warning: properties incomplete for index 163 = ထွေး
Warning: properties incomplete for index 164 = ယှ
Warning: properties incomplete for index 168 = ဘွား
Warning: properties incomplete for index 179 = လွ
Warning: properties incomplete for index 182 = န့်
Warning: properties incomplete for index 189 = စွဲ
Warning: properties incomplete for index 192 = ပြီး
Warning: properties incomplete for index 197 = မြေ
Warning: properties incomplete for index 202 = ကွာ
Warning: properties incomplete for index 210 = ရှာ
Warning: properties incomplete for index 211 = ဖွေ
Warning: properties incomplete for index 212 = တွေ့
Warning: properties incomplete for index 214 = ပြ
Warning: properties incomplete for index 215 = ကြာ
Writing unicharset to file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset

mya.Myanmar_Text.exp0.txt
mya.Myanmar_Text_Bold.exp0.txt
mya.unicharset.txt

@Shreeshrii
Copy link
Contributor

@herzcthu @nengine @minthanthtoo

Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset
in light of the above warning messages. Do you notice any pattern for the errors?

Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above.

@Shreeshrii
Copy link
Contributor

@theraysmith do zwj and zwnj also have to be part of unicharset?

also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj

@amitdo
Copy link

amitdo commented Mar 31, 2017

@Shreeshrii
Copy link
Contributor

Syllabification, Normalization and Lexicographic Ordering
of Myanmar Texts using Formal Approaches

http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf

@nengine
Copy link

nengine commented Mar 31, 2017

I do not see consistent pattern.

  1. Warning: properties incomplete for index 4 = ယ် . ယ် by itself does not have any meaning, but when it is combined with ဘ which becomes ဘယ် it makes sense.

  2. Warning: properties incomplete for index 16 = မှ . မှ by itself does make sense and has a meaning, but not so sure why it is giving a warning.

Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc.

It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example?

@Shreeshrii
Copy link
Contributor

I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs.

I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third.

Can you please check whether all of these are required?

http://www.alanwood.net/unicode/myanmar.html

http://www.alanwood.net/unicode/myanmar-extended-a.html

http://www.alanwood.net/unicode/myanmar-extended-b.html

@nengine
Copy link

nengine commented Mar 31, 2017

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Mar 31, 2017 via email

@nengine
Copy link

nengine commented Mar 31, 2017 via email

@theraysmith
Copy link
Contributor

theraysmith commented Apr 14, 2017 via email

@herzcthu
Copy link
Author

I've checked characters in Myanmar.unicharset file. All characters seem correct.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Jul 14, 2017

Please see tesseract-ocr/tesseract#995 (comment)

When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match

For instance, there is a big table in the unicode standard for Myanmar, (
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover
any of the extension Myanmar characters, and isn't explicit about whether
the table represents a specific valid order or not. The existence of a lot
of legacy Myanmar text on the web that is designed for non-compliant fonts
doesn't help make it easier to determine whether the filter is correct.

@theraysmith
Copy link
Contributor

theraysmith commented Jul 14, 2017 via email

@Shreeshrii
Copy link
Contributor

@herzcthu @nengine @minthanthtoo

Please test with the new traineddata in tessdata/best directory and provide feedback.

@herzcthu
Copy link
Author

herzcthu commented Aug 9, 2017

I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later.

@nengine
Copy link

nengine commented Aug 9, 2017

I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link?

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 10, 2017 via email

@herzcthu
Copy link
Author

both

I've attached first screenshot I've tested.
Upper part is image I've tested and lower part is OCR converted text.
Words between two adjacent same color points are missing or incorrect.
If you need code point comparison between source image and output text. I can provide later.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Aug 10, 2017 via email

@kyawswa
Copy link

kyawswa commented Oct 12, 2017

Hello, I would like to share what I found in Myanmar training data.

I used tesseract version 3.04.

I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png.

test result for ocr_sample_1 image is below. I marked with red point to see different.
image_file
ocr_sample_1

Result
screenshot from ocr_sample_1

And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English.

Image_file
ocr_sample_2

Result
screenshot from ocr_sample_2

And then I download the myanmar langdata from github.(https://github.com/tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text.
For example,
screenshot from 2017-10-12 21-01-05

#first arrow head line
It should be "ရုတ်ရုတ်သဲသဲ".

#Second arrow head line
It should be "သစ်တောများကုန်".

#third arrow head line
should be "ပညာရေးစနစ်". so on.

So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions.
-Exporting mya.traineddata is based on those file?
-How can I know which file is used for what? eg. what is mya.punc file?
-And where did you get those data?
-Is there any format or rule to put data into those files?

Could you please explain me about those files?
I am also willing to improve Myanmar language in OCR.

Thanks you for your contribution.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 12, 2017 via email

@kyawswa
Copy link

kyawswa commented Oct 13, 2017

Yes, I used UB Mannheim with tesseract 4.0.0-alpha.20170804. I test with the following image files. The following is test result.

ocr_sample_1.png
ocr_sample_1

Result

%%%%%
©05080×05
5082:40:82! 0=2405005$2³050

ocr_sample_2.png
ocr_sample_2

Result

ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး

Thanks.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Oct 13, 2017 via email

@kyawswa
Copy link

kyawswa commented Oct 13, 2017

Yes, it works perfectly on tessdata_best. But after I checking wordlist, there are many misspelling and incorrect data. I point out the outstanding misspellings. Please see the following attachment.

ss_21

The most of data from following link is not included in this tessdata_best wordlist.
https://github.com/kanaung/wordlists

I would like to know where did u get that data.

How can I contribute to update those incorrect data?
Thanks.

@pndaza
Copy link

pndaza commented Apr 2, 2018

myanmar traineddata 4.0 does not recognized for the following chars.
၊(u104a)
။(u104b)
၌(u104c)
၍(u104d)
၎(u104e)
၏(u104f)

I unpacked and checked unicharset.
I not found these char in unicharset.

Also unicharset-extractor does not produce these chars.

@kpfoley
Copy link

kpfoley commented Dec 12, 2018

I'm just reading through this thread and have a few pointers, some maybe a little repetitive - hope this helps!

  1. Unicode range: just about everything in contemporary use is in the 1000-104F range. The second half of the base range is for special characters needed for writing a number of other languages spoken in Myanmar but also for some Pali words that might be needed, for example, for religious and historical texts. I'm not sure whether users in Myanmar would rather see a model that supports the full range or a more compact model trained on the labels in the first half of the range, which are MUCH more frequently used. In any case training on the 1000-104F range will probably cover > 99.9% of the data (just a guess).

  2. Word lists - the best word list available to researchers AFAIK is a 133k word list maintained by the Myanmar Language Commission (MLC). This is basically just a list of all the words from the large official dictionary of the language. I don't think the word list is publicly available, and it doesn't have any proper nouns, but it would be a good resource if it could be made available for this project. The kanaung word list linked above by @herzcthu is much better than the mya wordlist and I think it is based on an old software release of a "correct spelling" word list and has since incorporated words and place names from other sources like the postal system. I think there are a good deal of non-words and very rare words in that list, though, which maybe limits its usefulness.

  3. Words and spacing - as mentioned above, Burmese / Myanmar doesn't use spaces between every word in its writing system. Spaces are thrown in as needed for typography and usually between multi-word phrases. So the best options for a language model are probably either a character-based language model or a model that incorporates word segmentation (prediction of spaces) into the pipeline. I'm not sure if either of these are possible with Tesseract 4.0. As somebody already mentioned above, the lack of spacing is probably why there are >500,000 words in the mya.wordlist file. As of right now I think the state of the art in Burmese word segmentation (breaking non-spaced continuous text into words) is around 99 percent accuracy -- it's not perfect but it's pretty good.

  4. Syllable segmentation, such as Ye Kyaw Thu's script here https://github.com/ye-kyaw-thu/myPOS/tree/master/corpus-draft-ver-1.0, might also be simpler and more effective than a character-level language model in the absence of word segmentation. A few lines of regex can capture the boundaries between syllables, in which case the language model approach can be similar to what you might use for Chinese (sequences of syllables instead of sequences of words).

  5. Unicode and Zawgyi - the choice between Unicode and Zawgyi is controversial in Myanmar, and Zawgyi is much more popular, with Unicode maybe making some inroads. The problem with Zawgyi, in addition to being non-standard, is that it takes up random spaces in the shared Myanmar unicode range, including spaces for other languages spoken in Myanmar, so it breaks not just Burmese but also the entire extended Myanmar unicode range. It's also frustrating because it used to be difficult to cleanly convert from Zawgyi to Unicode and back, and it still doesn't work perfectly because Zawgyi hides many typing errors by superimposing the same typed letter without advancing the cursor. For this reason there is a lot of corrupted Burmese language text data on the web that may have been converted to unicode at some point (or could be converted) but it's difficult to catch all of the errors left over from the original typed Zawgyi input. Wikipedia has a good explainer here on the two encoding standards: https://my.wikipedia.org/wiki/Wikipedia:Font#Why_not_Zawgyi? The main thing to watch out for here is not to accidentally feed Zawgyi text into the training data, because with so many overlapping codepoints it could wreck the accuracy of the model.

@Shreeshrii
Copy link
Contributor

Thank you for the detailed notes. Please review the source training data in langdata_lstm repo also.

@Shreeshrii
Copy link
Contributor

Please test the traineddata at https://github.com/Shreeshrii/tessdata_shreetest/blob/master/mya430000.traineddata

and let me know whether it is an improvement over the existing traineddata files.

@herzcthu
Copy link
Author

herzcthu commented Mar 4, 2019

Hi Shreeshrii,
I've tested your traineddata. It is a little improved over existing traineddata in tesseract 4.0 beta.
Especially it can detect punctuation and non-burmese characters better.

BTW, I'm trying to train myself, I'm generating lots of box and tif files for only one font. Is that a good idea to have many files for single font? Or should I make it only one box and one tif file.
Currently I have more than 1000 files.

thanks and regards,
Sithu

@Shreeshrii
Copy link
Contributor

The amount of training data you need depends on the type of training that you are planning to do. eg. from scratch, replace a layer, plus minus, etc.

I think multiple files for single font may be ok. How are you generating these files?

You should try to keep approximately the same number of lines in each file so that all samples are used in a uniform way for training.

@Shreeshrii
Copy link
Contributor

I had used
'Myanmar Khyay'
'Myanmar Sans Pro'
'Myanmar Text'
'Noto Sans Myanmar' \

Which one is a more representative font out of these for training and testing?

@herzcthu
Copy link
Author

herzcthu commented Mar 5, 2019

I took one paragraph from wikipedia. Make screenshots with all fonts you have mentioned.
Noto Sans Myanmar has best result.
There is new fonts which will be used in officials documents. It is called Pyidaungsu
You can download here https://www.unicode.today/fonts-download/

I'm creating box and tif files using text2image binary from training.
I collected 1 millions unicode text lines from wikipedia and 3 famous news websites. Creating box files from that contents.

@Shreeshrii
Copy link
Contributor

@herzcthu Thanks for the info about the new font.

If you do 'replace layer' type of training, you can get by with fewer lines.

Keep posting about your progress with training.

@herzcthu
Copy link
Author

herzcthu commented Mar 8, 2019

I'm stuck at unicharset extractor.
I get one unicharset file. But when I open that file, I'm seeing some unusual combination of characters which is not possible to exist in Burmese scripts. I wonder if this kind of junk can affect training.
I've attached unicharset file I got. output_unicharset.txt

Here is some sample which is not usual

တ္မြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 514 0 514 တ္မြ	# တ္မြ [1010 1039 1019 103c ]x
တ္က်ေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 515 0 515 တ္က်ေ	# တ္က်ေ [1010 1039 1000 103a 1031 ]x
ဥ္တြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 516 0 516 ဥ္တြ	# ဥ္တြ [1025 1039 1010 103c ]x
င္လေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 517 0 517 င္လေ	# င္လေ [1004 1039 101c 1031 ]x
ည္ဖြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 518 0 518 ည္ဖြ	# ည္ဖြ [100a 1039 1016 103c ]x

@Shreeshrii
Copy link
Contributor

Use NORM_MODE="3" with unicharset extractor command.

@herzcthu
Copy link
Author

Tried norm_mode 2 and 3. Both has missing vowels and medial.

@n-92
Copy link

n-92 commented Apr 7, 2020

Hi,

Any further progress?

@GmGniap
Copy link

GmGniap commented Sep 6, 2020

Please mention or let me know if something you need help for checking/fixing Burmese datasets, I'm gladly to be part of it. I've some experience in Python & Typescript. Cheers! all for helping to improve Myanmar Language in machines.

@tesseract-ocr tesseract-ocr deleted a comment from bykovman Apr 26, 2021
@glxwine
Copy link

glxwine commented Jun 19, 2024

I am new to tesseract. Recently I tried Myanmar language. It is still not perfected yet. I searched the training of data set and found this thread. However, it seems to be very old and no recent updates. I am not familiar with "how to train the data sets", but I know the language. Is there anyway that we can do to improve the Myanmar language? I also wish to understand how the training is done.

@stweil
Copy link
Member

stweil commented Jun 21, 2024

The training requires training data = lots of line images (*.png) with corresponding transcription (*.gt.txt). The original training used generated (artificial) line images, but meanwhile newer trainings for other scripts are often based on real line images from scanned books or newspapers. It's also possible to use a mix of artificial and real line images. You need as many lines as possible, and the text must cover all relevant glyphs (characters).

With enough lines for training, you can use tesstrain for the training.

Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder.

Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/.

Make sure to document your training process and to publish your training data if you want to submit the result for the inclusion in the tesseract-ocr repositories.

@glxwine
Copy link

glxwine commented Jun 24, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests