You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .
The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.
@theraysmith commented 2 days ago
Update: after going back to the www to get fresh data, I believe that my corpus text is now good for:
chr
dzo
iku
snd
syr
tgk
tir
I have put a lot of time into cleaners/filters for languages that use 'virama' characters.
I am not convinced that they are perfect, but I will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better.
In response to tesseract-ocr/tesseract#654 (comment),
@theraysmith
An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .
The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.
chr - Cherokee - http://crubadan.org/languages/chr
Cherokee Unicode Fonts
http://www.cherokee.org/AboutTheNation/Language/CherokeeFont.aspx
http://www.languagegeek.com/font/fontdownload.html
The text was updated successfully, but these errors were encountered: