You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
first of all, thank you for this wonderful work! Given that Mecab by itself does a bit of a mediocre job splitting into actual words, I've been wondering how sites like jisho.org do their sentence splitting and eventually landed here 👋
Since I needed this tech in an upcoming desktop app for Japanese learners (which includes OCR, sentence splitting, dictionary lookups and more), I took it upon me to port Ve's ipadic parser to Rust: https://github.com/jannisbecker/ve-rs. So far it seems to work great, down to having the same reported bugs as Ve 😄
I'm still fairly new to Rust so this was a pleasant learning experience as well. I'd like to share my experience here diving into Ve's sentence post-processing code and things I changed or wondered about while porting the code:
Rust requires strict definition of data structures, so I turned the POS variants, Grammar variants etc into enum definitions. The first thing that I noticed was that Ve's codebase treats all POS levels the same way. Essentially pos, pos2, pos3, pos4, inflection_type, inflection_form can theoretically be any variation of POS, which I assume doesn't really happen in practice with Mecab (say pos2 or even pos4 being classified Meishi). I played around with dividing all variants into seperate enums for POS1-4, InflectionType and InflectionForm, but I lacked knowledge in which field can exactly be which variant, so I rolled back the change.
If Mecab provided an exact list of which field can contain which variant, then it might be an idea to split them up in code as well for clarity and ease of development.
Of course, understanding the whole set of implemented rules that alter tokens into words was not possible, particularly since I'm not versed in mecab classification, but other than that the code made a lot of sense (things like eating up the next or merging with previous tokens, merging in the fields, altering the resulting POS based on rules etc). All in all it was a very smooth process porting it, even if I've never seen a line of Ruby before 👍
I noticed that with my tokenizer, using a normal ipadic 2.7.0 dictionary, I've got rare cases where the feature string of a token didn't split up into 9 but only 6 features, leaving out lemma, reading and hatsuon completely instead of marking them with an asterisk. One token where this happened was ハハ . It might just be a bug in the tokenizer I used, but I had to account for it when destructuring the feature string.
While not necessary for my own project, I might take it upon me to port the other parsers (and general structure) of Ve as well in order to provide feature parity. For anyone looking for a mecab-ipadict sentence splitting in Rust right now though, you can use it as is.
The text was updated successfully, but these errors were encountered:
Thank you for the kind words, I'm glad that the code was easy to comprehend 😊
I just tested ハハ in MeCab with IPADIC, and got the same result without lemma and hatsuon. Ruby Ve handles this by just giving nil for those properties. But it's something I should think about how to handle better.
Unless you need it for your own purposes, I wouldn't stress about porting the rest of Ve. It's really only the MeCab+IPADIC part that anyone uses 😄
Hi,
first of all, thank you for this wonderful work! Given that Mecab by itself does a bit of a mediocre job splitting into actual words, I've been wondering how sites like jisho.org do their sentence splitting and eventually landed here 👋
Since I needed this tech in an upcoming desktop app for Japanese learners (which includes OCR, sentence splitting, dictionary lookups and more), I took it upon me to port Ve's ipadic parser to Rust: https://github.com/jannisbecker/ve-rs. So far it seems to work great, down to having the same reported bugs as Ve 😄
I'm still fairly new to Rust so this was a pleasant learning experience as well. I'd like to share my experience here diving into Ve's sentence post-processing code and things I changed or wondered about while porting the code:
Rust requires strict definition of data structures, so I turned the POS variants, Grammar variants etc into enum definitions. The first thing that I noticed was that Ve's codebase treats all POS levels the same way. Essentially pos, pos2, pos3, pos4, inflection_type, inflection_form can theoretically be any variation of POS, which I assume doesn't really happen in practice with Mecab (say pos2 or even pos4 being classified Meishi). I played around with dividing all variants into seperate enums for POS1-4, InflectionType and InflectionForm, but I lacked knowledge in which field can exactly be which variant, so I rolled back the change.
If Mecab provided an exact list of which field can contain which variant, then it might be an idea to split them up in code as well for clarity and ease of development.
Of course, understanding the whole set of implemented rules that alter tokens into words was not possible, particularly since I'm not versed in mecab classification, but other than that the code made a lot of sense (things like eating up the next or merging with previous tokens, merging in the fields, altering the resulting POS based on rules etc). All in all it was a very smooth process porting it, even if I've never seen a line of Ruby before 👍
I noticed that with my tokenizer, using a normal ipadic 2.7.0 dictionary, I've got rare cases where the feature string of a token didn't split up into 9 but only 6 features, leaving out lemma, reading and hatsuon completely instead of marking them with an asterisk. One token where this happened was ハハ . It might just be a bug in the tokenizer I used, but I had to account for it when destructuring the feature string.
While not necessary for my own project, I might take it upon me to port the other parsers (and general structure) of Ve as well in order to provide feature parity. For anyone looking for a mecab-ipadict sentence splitting in Rust right now though, you can use it as is.
The text was updated successfully, but these errors were encountered: