-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization of japan text with disabled default features #229
Comments
So far it's not possible to split CJK characters like you want, |
Can I work on this @ManyTheFish |
Hello @XshubhamX Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this! For your information, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue. We are looking forward to reviewing your PR 😊 |
Hi!
We are trying to integrate Charabia in here: qdrant/qdrant#2260
Our big concern is binary size, that's why we are trying to use it with disabled dictionaries for Japanese, Korean and Chinese.
Version 7.2 seemed to have a default behavior of splitting text per-character in this case:
本日の日付は
->["本", "日", "の", "日", "付", "は"]
which was fine for our purposes. New version, however, doesn't do that anymore:
本日の日付は
->["本日の日付は"]
I wonder if it is an intended behavior change and is it possible to configure segmenter to behave in a way it worked before?
The text was updated successfully, but these errors were encountered: