-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove hardcoded segmenter data from datagen #3003
Comments
@aethanyc --- could you take a look at these files and list where you think appropriate data sources would be? |
LSTM models is from https://github.com/unicode-org/lstm_word_segmentation/tree/develop/Models. We can download them if they are packed in Dictionary toml files are converted via the following command. (See the comment in the beginning of each toml file.) For example, CJ dictionary:
Maybe the conversion and packing can be part of the ICU4C release process so that we can download it somewhere? UAX14 rules are implemented ourselves in cc @makotokato to double check my knowledge. |
Correct. #2519 for char16trie data generation. |
@robertbastian @sffc Per my comment above, it is nice to have the LSTM and dictionary data download/generated from somewhere, so this issue seems like a P3 or P4 to me. Does it have to be P1 to block the release? |
I guess my main concern is that this involves adding additional sources to datagen, so it would be best to have those in place when people start using datagen for segmenter. |
cc @eggrobin this is the issue, which has multiple parts:
|
|
Discussions:
|
Discussion: we could upstream the data files into CLDR but we need to make sure they are easily maintainable. |
@robertbastian to drive the relationship with CLDR to get these files upstreamed. |
So there are segmentation files in CLDR already, which are generated from https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt (this is a lot closer to the spec and might become part of the spec in the future). Given that the data is already there, I think it's unlikely that our (more processed) TOML versions will get accepted. I've added parsing for these CLDR files in #3440, but I will need @makotokato's help to generate our representation from those. |
Closing this for now as LSTM and dictionary data is done, and rules is a bigger Unicode-wide undertaking for Q3. |
Follow-up: #3457 |
There are many data files located here:
https://github.com/unicode-org/icu4x/tree/main/provider/datagen/data/segmenter
Is this the best place for the source of truth, or can we source them from elsewhere?
The text was updated successfully, but these errors were encountered: