-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of languages in development #91
Comments
This comment has been minimized.
This comment has been minimized.
For group 4 you could add Ukrainian, Bulgarian, and may be Mongolian, although it is not Slavic it uses Cyrillic script. |
Do you plan to only work with human languages? It would be amazing to add a model to recognize mathematical formulas. |
I guess Tamil, Telugu can be added to one group because they belongs to a language group called 'Dravidian'. Meaning they relate to each other in terms of grammar, word arrangement.Two other popular( in India) language, which belong to that family can also be added to that group— Kannada and Malayalam (For further info— https://en.m.wikipedia.org/wiki/Dravidian_languages). Moreover Telugu and kannada share some common alphabets and words. I will be adding alphabet and words of kannada language for language request. |
For Group 4 |
I'd highly recomend supporting Devanagiri Script (Wiki - https://en.wikipedia.org/wiki/Devanagari), which is the fourth most widely adopted writing system in the world. Please go through the wikipedia link to understand its wide spread usage across most Ancient Languages including Sanskrit, Hindi, Marathi, Awadhi, Haryanvi. I see you have included "Hindi" as a target language, which of course, is the most spoken language in the Indian Subcontinent. If you could let me know what's the current word-count you have (maybe share the "dict" & "alphabets" directory), I can continue with the research to share more details about the Language as it's my First Language. Hindi has 47 primary alphabets (including 14 Vowels & 33 Consonants). You can contact me @ prakash.upadhyay93@yahoo.com |
Can i help for the Persian (Farsi) language ? I can supply some popular words and characters |
Can i contribute in any way. I am fluent in Hindi alongside English. Also I may be of help in the programming section. I know Python, C and Java in languages. Am good in front-end with HTML, CSS and JavaScript (basic). |
I recommend adding Punjabi language which is the 10th most spoken language around the world. |
@edloginova After doing human language, we can explore math as well. @upadhyayprakash Lists are here easyocr/character and easyocr/dict @arashjafari looks like we already have both words and char. You can recheck if everything is alright. @junaidgirkar sounds good, I'll keep in mind. May call you for help. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@rkcosmos For Group 1, could you please add Urdu to that group? Urdu is very similar to Arabic and Persian and I've just submitted the PR for the character list and a dictionary. So it should be ready to go! cc: @rahilwazir |
This might help for Arabic: |
i added Marathi character and dictionary data set file please train it |
@sardasumit did you forget a link for mr_char.txt? |
@rkcosmos it is same like Hindi character |
@rkcosmos |
Hi! Thanks for your work. Some notes about Hebrew, there are some ending form of letters (it means that some letter is changing their form if they are placed at the end of words) https://en.wikipedia.org/wiki/Final_form Also there are diacritical signs https://en.wikipedia.org/wiki/Niqqud that used to represent vowels or distinguish between alternative pronunciations of letters (in Arabic also there are final forms(and not only) and diacritical signs) I didn't provide diacritical signs, assume it's better to train first of all without them (usual writing consists from usual letters without diacritical signs) |
remembered the important thing. in Hebrew, there is cursive(https://en.wikipedia.org/wiki/Cursive_Hebrew) and sometimes people mixed it up together with usual writing even using printed matter, it's the same letters (chars), but let's say it's another font (e.g. https://opensiddur.org/wp-content/uploads/fonts/display-font-charmap.php?fnt=DorianCLM-Italic ) maybe it's also better not to implement immediately, don't know |
@nishad Malayalam and Tamil are both Dravidian but do not use the same script. So I have to build 2 model. |
Question for Indian: I'm looking into Hindi char and dict, there are a lot of chars seen in word list but not in char list. Examples are |
@rkcosmos Those are part of the existing alphabet when combined it creates a new alphabet, I think the technical term is grapheme? I'm not sure. I would like to know they render fine or something happens like it did with Tamil. |
@Vijayabhaskar96 So far, Devanagari doesn't have any problem. They support unicode well. |
another addition about Hebrew;) and it's important. some diacritic signs are important, like geresh and gershayim. using geresh with ג ז צ we could use for the sounds - j g, ch, that are not represented in the alphabet and double geresh (gershayim) it's for widely spread short phrase, words (kitsur) most famous is the תנ"ך (Tanakh). Sometimes people could use usual quotation marks (apostrophe) instead of typing geresh or gershayim (e.g. תנ''ך) |
@tsaidevin yes. |
Are model training scripts not there? If somebody wants to train on a new language. How can one contribute to betterment of model? |
Hi @rkcosmos , really impressive work. |
Can you add Dzongkha? Dzongkha is the national language of Bhutan and it is similar to the Tibetan Language. Similar to Thai language, it is written continuously from left to right and does not have a whitespace between words. Following paper discusses on next syllabus prediction for Dzongkha. https://doi.org/10.1016/j.jksuci.2021.01.001 |
Hi @rkcosmos, Thanks in advance. |
Hi @rkcosmos , Can you share the Japanese dataset you used to train? Thanks a lot! |
Hi @rkcosmos. Thank you very much for the efforts you are taking. Is there a plan to include the Indian languages - Gujarati and Oriya ? |
hii @rkcosmos do you know if the hebrew will be ready soon? thnak a lot! |
Hello, @rkcosmos thank you for your great job. |
Hi @rkcosmos, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people. One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary. |
@rkcosmos Question: Why Chinese dict is pinning rather than Chinese? In the dict folder, cannot find the Chinese dict(not pinying)?How to achieve this mapping relationship? If I want to add some words in Chinese dict, how do I add training data and dict? |
@rkcosmos is Greek language updated? I saw someone contributing for greek in the comment. |
does easyocr support Sinhala language? |
Hey, thank you for this Repo 🙏 |
i want use farsi language but i see it is not fine tune on 5 farsi and fine tune on 5 arabic |
For Group 3 (Devanagari) |
Hey, I saw that the issue about the support of the greek language is completed and I can see the two required .txt documents about greek. However, I cannot find the greek language code ('gre' in the repo) in the list with the supported languages that is on the website. Is greek actually supported? |
urdu language not supported Easyocr model ? |
There is a misstake in name of Group 1.It has to be Persian scripts.If you search you will see that Persian is the mother language of others and the rest Arabic, Urdu and Uyghur were taken from it(Persian Language). |
Please let me know if Amharic or Tigrinya can be added, thanks! @AinazRafiei |
@nmermigas looking for Greek as well. Could you find a way to "train" EasyOCR for it? Or is it something that the developer team must train? |
I am also looking for Greek. |
Gujarati language please |
Any update on whether EasyOCR will be adding support for Hebrew, and if so, when? Thank you. |
I will update/edit this issue to track development process of new language. The current list is
Group 1 (Arabic script)
Group 2 (Latin script)
Group 3 (Devanagari)
Group 4 (Cyrillic script)
Group 5
Group 6 (Language that doesn't share characters with others)
Group 7 (Improvement and possible extra models)
Guideline for new language request
To request a new language support, I need you to send a PR with 2 following files
If your language has unique elements (such as 1. Arabic: characters change form when attach to each other + write from right to left 2. Thai: Some characters need to be above the line and some below), please educate me with your best ability and/or give useful links. It is important to take care of the detail to achieve a system that really works.
Lastly, please understand that my priority will have to go to popular language or set of languages that share most of characters together (also tell me if your language share a lot of characters with other). It takes me at least a week to work for new model. You may have to wait a while for new model to be released.
The text was updated successfully, but these errors were encountered: