cherokee resources #44

Shreeshrii · 2017-02-04T13:12:50Z

In response to tesseract-ocr/tesseract#654 (comment),

An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .

The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.

chr - Cherokee - http://crubadan.org/languages/chr

Cherokee Unicode Fonts

http://www.cherokee.org/AboutTheNation/Language/CherokeeFont.aspx
http://www.languagegeek.com/font/fontdownload.html

Shreeshrii · 2017-04-01T02:04:25Z

tesseract-ocr/tesseract#654 (comment)

@theraysmith commented 2 days ago
Update: after going back to the www to get fresh data, I believe that my corpus text is now good for:
chr
dzo
iku
snd
syr
tgk
tir
I have put a lot of time into cleaners/filters for languages that use 'virama' characters.
I am not convinced that they are perfect, but I will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better.

Shreeshrii closed this as completed Apr 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cherokee resources #44

cherokee resources #44

Shreeshrii commented Feb 4, 2017

Shreeshrii commented Apr 1, 2017

cherokee resources #44

cherokee resources #44

Comments

Shreeshrii commented Feb 4, 2017

Shreeshrii commented Apr 1, 2017