-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hiraganas does not take context into account (like numbers) #12
Comments
I see.. |
I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not. |
Sure, if it goes well in your project please let me know and send me a pull request if possible. Handling cases one by one is better than handling nothing, thanks for the advice. |
I think you could use JMDICT and go through each element in that dictionary and check if the reading generated by NMeCab is the same as in the dictionary, if not, add it to a list that you can use to determine the special cases. It would be faster than doing it manually one by one. |
Well I downloaded the Wacton Desu nuget package (which is a dotnet port of the JMDICT) to make some tests and I just realized that the issue seems to only appear with kanjis showing numbers. Indeed, with a sentence, words are correctly "divided" from each others. For example : 日本語を勉強します --> [日本語] [を] [勉強] [し] [ます] |
Yeah I think just fixing numbers would be enough |
Hmm so maybe the root cause is the way NmeCab is parsing sentences ? Maybe there is a way to make it correctly group counters and numbers, I'll investigate this way. |
I think the actual problem is from the IpaDic, NMeCab uses that dictionary to parse the sentences. |
I think I found a solution to deal with that issue. Maybe not the best one, but at least seems to work. |
I don't think implementing Wacton library is a good idea because it uses a lot of RAM (about 460MB for the Japanese enteries) so it's better to run this test in a separate project and get all the cases where the reading is wrong and then save them in a json or xml and then use that to check for wrong readings in Kawazu |
I tried running the test to get all the cases but I end with 58109 wrong readings, this seems anormaly big to me. |
Hi,
Not sure if this is a Kawazu or LibNMeCab issue, but when converting kanjis ignores exceptions.
For example, if one wants to convert 300 with 三百, it will output さんひゃく (sanhyaku) but the correct answer is さんびゃく (sanbyaku).
Same for 600 and 900.
Currently working on a workaround on my fork: lasyan3@156bf7e
The text was updated successfully, but these errors were encountered: