Tokenization of japan text with disabled default features #229

generall · 2023-07-17T23:39:03Z

Hi!

We are trying to integrate Charabia in here: qdrant/qdrant#2260
Our big concern is binary size, that's why we are trying to use it with disabled dictionaries for Japanese, Korean and Chinese.

Version 7.2 seemed to have a default behavior of splitting text per-character in this case:

本日の日付は -> ["本", "日", "の", "日", "付", "は"]

which was fine for our purposes. New version, however, doesn't do that anymore:

本日の日付は -> ["本日の日付は"]

I wonder if it is an intended behavior change and is it possible to configure segmenter to behave in a way it worked before?

The text was updated successfully, but these errors were encountered:

ManyTheFish · 2023-07-18T09:21:52Z

So far it's not possible to split CJK characters like you want,
however, a new segmenter could be implemented to do the job and would be activated with a feature flag.
If you want to do a PR, I would agree to merge it. 😃

XshubhamX · 2024-04-08T10:14:40Z

Can I work on this @ManyTheFish

curquiza · 2024-04-11T10:00:27Z

Hello @XshubhamX

Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!

For your information, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.

We are looking forward to reviewing your PR 😊

generall mentioned this issue Jul 18, 2023

Second thoughts on CJK tokenizer qdrant/qdrant#2260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of japan text with disabled default features #229

Tokenization of japan text with disabled default features #229

generall commented Jul 17, 2023

ManyTheFish commented Jul 18, 2023

XshubhamX commented Apr 8, 2024

curquiza commented Apr 11, 2024

Tokenization of japan text with disabled default features #229

Tokenization of japan text with disabled default features #229

Comments

generall commented Jul 17, 2023

ManyTheFish commented Jul 18, 2023

XshubhamX commented Apr 8, 2024

curquiza commented Apr 11, 2024