Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cherokee resources #44

Closed
Shreeshrii opened this issue Feb 4, 2017 · 1 comment
Closed

cherokee resources #44

Shreeshrii opened this issue Feb 4, 2017 · 1 comment

Comments

@Shreeshrii
Copy link
Contributor

In response to tesseract-ocr/tesseract#654 (comment),

@theraysmith

An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .

The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.

chr - Cherokee - http://crubadan.org/languages/chr

Cherokee Unicode Fonts

http://www.cherokee.org/AboutTheNation/Language/CherokeeFont.aspx
http://www.languagegeek.com/font/fontdownload.html

@Shreeshrii
Copy link
Contributor Author

tesseract-ocr/tesseract#654 (comment)

@theraysmith commented 2 days ago
Update: after going back to the www to get fresh data, I believe that my corpus text is now good for:
chr
dzo
iku
snd
syr
tgk
tir
I have put a lot of time into cleaners/filters for languages that use 'virama' characters.
I am not convinced that they are perfect, but I will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant