Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking down infant speech #22

Open
vietqhoang opened this issue Feb 10, 2015 · 1 comment
Open

Breaking down infant speech #22

vietqhoang opened this issue Feb 10, 2015 · 1 comment

Comments

@vietqhoang
Copy link
Contributor

Not sure if this is within the scope of ve but here it goes...

Using:

  • Ruby 2.1.5
  • ve 0.0.3

Case 1

Actual:

string = 'しょれでびしょびしょになったー'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["し", "ょれでびしょびしょになった", "ー"] 

Expected:

 => ["しょれで", "びしょびしょ", "になったー"]

Case 2

Actual:

string = 'じゃさしいもん'

words  = Ve.in(:ja).words(string).map(&:word)
 => ["じゃ", "さしい", "もん"] 

Expected:

 => ["じゃさしい", "もん"] 
@Kimtaro
Copy link
Owner

Kimtaro commented Feb 13, 2015

I think that the main issue it that the dictionary mecab is using doesn't have many kana only words, so it doesn't know what to do with long strings of kana.

But you can add words to a custom dictionary and have mecab use that in addition to the main dictionary. I do this in beta Jisho to support words from JMdict and Wikipedia.

So if you added しょれ and じゃさしい as words to the dictionary it might be able to understand these sentences. I say might because I haven't tried this with kana only words and sentences.

The mecab site has a page on adding words: http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html and Ve allows you to pass command line options so you can tell mecab to start with the dictionary loaded.

There's a few quirks to be aware of though. For example, you can't modify a running mecab's dictionary, so you have to build a different filename each time. I use the suffices A and B. You must also specify a PoS that exists in the main dictionary you are building from.

I guess ideally I should clean up the code I have around this and release it, but it'd take a while probably :/

Let me know if you have any questions about this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants