-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not getting the same results as Kuromoji java #23
Comments
It appears to be an preference issue, it's matching both 協 and 会 as 接尾 (suffix) before the whole word. #16 is matching 人名 (name) for 研 and 究 before the whole word. Perhaps the matching algorithm needs to favor longer tokens before splitting into finer matches. |
@Citronelol I released fixed version of 0.1.2, and deployed the demo site https://takuyaa.github.io/kuromoji.js/demo/tokenize.html |
Thanks a lot ! |
This was referenced Apr 24, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I was trying to tokenize the following sentence :
第1条 この法人は、一般社団法人国際銀行協会(以下「本協会」という。)と称し、英文では、 International Bankers Association of Japanと記載する。
and the results are different when using the java version of kuromojin (with Ipadic dictionary) and the tokenizer provided by kuromoji.js. In particular, the following sequence 協会 is splitted in kuromoji.js.
I saw a closed issue (#16) stating this could due to the Viterbi version of the tokenizer. Is there a way to disable it ?
Many thanks in advance,
Best
The text was updated successfully, but these errors were encountered: