Remove hardcoded segmenter data from datagen #3003

robertbastian · 2023-01-18T21:52:53Z

There are many data files located here:

https://github.com/unicode-org/icu4x/tree/main/provider/datagen/data/segmenter

Is this the best place for the source of truth, or can we source them from elsewhere?

sffc · 2023-02-08T01:14:42Z

@aethanyc --- could you take a look at these files and list where you think appropriate data sources would be?

aethanyc · 2023-02-09T23:55:55Z

LSTM models is from https://github.com/unicode-org/lstm_word_segmentation/tree/develop/Models. We can download them if they are packed in lstm_word_segmentation repository. Note: they are currently only available in the develop branch, but not in the main branch.

Dictionary toml files are converted via the following command. (See the comment in the beginning of each toml file.) For example, CJ dictionary:

# This data is created by the following using ICU4C tools
# LD_LIBRARY_PATH=lib bin/gendict --uchars data/brkitr/dictionaries/cjdic.txt tmp.bin
# dd if=tmp.bin of=cjdict.dict bs=1 skip=64

Maybe the conversion and packing can be part of the ICU4C release process so that we can download it somewhere?

UAX14 rules are implemented ourselves in line.toml, and UAX29 rules in grapheme.toml, sentence.toml, and word.toml. These toml files are written by hand, not derived from other files. They are the source of truth, and should live in ICU4X.

cc @makotokato to double check my knowledge.

makotokato · 2023-02-10T07:06:42Z

Correct. #2519 for char16trie data generation.

aethanyc · 2023-02-22T01:22:59Z

@robertbastian @sffc Per my comment above, it is nice to have the LSTM and dictionary data download/generated from somewhere, so this issue seems like a P3 or P4 to me. Does it have to be P1 to block the release?

sffc · 2023-02-22T01:37:41Z

I guess my main concern is that this involves adding additional sources to datagen, so it would be best to have those in place when people start using datagen for segmenter.

Manishearth · 2023-04-25T16:11:49Z

cc @eggrobin this is the issue, which has multiple parts:

We hardcode rule-based data (manually written)
We hardcode dictionary data (generated from icu4c, probably should be in icuexportdata)
We hardcode LSTM data (comes from a different repo)

robertbastian · 2023-04-26T08:36:05Z

ICU-22373 Export segmentation dictionaries icu#2441 to add dictionary data to icuexportdata
Stabilise Khmer and Lao models lstm_word_segmentation#21 to stabilise Khmer and Lao models in the LSTM repo

sffc · 2023-04-27T15:48:15Z

Discussions:

For the manually written rule tables, we should consider upstreaming them. They are basically machine-readable versions of UAX 14/29. CC @FrankYFTang for thoughts on that.
For the dictionary data, it makes sense to add it to icuexportdata. We will probably keep the hardcoded dictionary tables in datagen until 2.0 to avoid breakages.
For the LSTM data, make a release on the LSTM repo with the JSON files. This is forwards compatible except for the no_default_features mode of icu4x-datagen. Probably keep the LSTM data in datagen too, until 2.0.

robertbastian · 2023-05-10T13:41:25Z

With #3396 and #3399 dictionary and LSTM sources are now controlled by the client. However, we will keep the hardcoded fallback data around until 2.0.

What's left are the handwritten rule tables.

sffc · 2023-05-11T17:32:52Z

Discussion: we could upstream the data files into CLDR but we need to make sure they are easily maintainable.

sffc · 2023-05-11T18:12:33Z

@robertbastian to drive the relationship with CLDR to get these files upstreamed.

robertbastian · 2023-05-17T13:53:04Z

So there are segmentation files in CLDR already, which are generated from https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt (this is a lot closer to the spec and might become part of the spec in the future). Given that the data is already there, I think it's unlikely that our (more processed) TOML versions will get accepted.

I've added parsing for these CLDR files in #3440, but I will need @makotokato's help to generate our representation from those.

robertbastian · 2023-05-25T17:17:46Z

Closing this for now as LSTM and dictionary data is done, and rules is a bigger Unicode-wide undertaking for Q3.

sffc · 2023-05-25T21:30:05Z

Follow-up: #3457

robertbastian added C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) A-data Area: Data coverage or quality labels Jan 18, 2023

robertbastian mentioned this issue Jan 18, 2023

Move Segmenter to Components #2259

Closed

22 tasks

sffc added this to the 1.2 Blocking ⟨P1⟩ milestone Jan 19, 2023

sffc assigned younies Jan 19, 2023

sffc assigned aethanyc Feb 8, 2023

sffc added the discuss Discuss at a future ICU4X-SC meeting label Mar 23, 2023

sffc unassigned aethanyc Mar 23, 2023

sffc modified the milestones: 1.2 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩, 1.3 Blocking ⟨P1⟩ Mar 23, 2023

sffc assigned robertbastian Mar 23, 2023

robertbastian mentioned this issue Apr 25, 2023

Datagen depends on compile-time paths at runtime, and panics if it can't find them #3391

Closed

robertbastian mentioned this issue Apr 26, 2023

Changing datagen builtin data structure to match external sources #3395

Merged

sffc mentioned this issue Apr 27, 2023

2.0 cleanup list #2856

Open

37 tasks

robertbastian mentioned this issue May 2, 2023

Adding LSTM data source #3399

Merged

robertbastian unassigned younies and robertbastian May 10, 2023

robertbastian self-assigned this May 10, 2023

robertbastian added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels May 10, 2023

robertbastian assigned makotokato May 17, 2023

sffc removed the discuss-priority Discuss at the next ICU4X meeting label May 25, 2023

robertbastian closed this as completed May 25, 2023

robertbastian mentioned this issue Jun 22, 2023

Using dictionaries from exportdata #3396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove hardcoded segmenter data from datagen #3003

Remove hardcoded segmenter data from datagen #3003

robertbastian commented Jan 18, 2023 •

edited by sffc

Loading

sffc commented Feb 8, 2023

aethanyc commented Feb 9, 2023

makotokato commented Feb 10, 2023

aethanyc commented Feb 22, 2023

sffc commented Feb 22, 2023

Manishearth commented Apr 25, 2023

robertbastian commented Apr 26, 2023

sffc commented Apr 27, 2023

robertbastian commented May 10, 2023

sffc commented May 11, 2023

sffc commented May 11, 2023

robertbastian commented May 17, 2023

robertbastian commented May 25, 2023

sffc commented May 25, 2023

Remove hardcoded segmenter data from datagen #3003

Remove hardcoded segmenter data from datagen #3003

Comments

robertbastian commented Jan 18, 2023 • edited by sffc Loading

sffc commented Feb 8, 2023

aethanyc commented Feb 9, 2023

makotokato commented Feb 10, 2023

aethanyc commented Feb 22, 2023

sffc commented Feb 22, 2023

Manishearth commented Apr 25, 2023

robertbastian commented Apr 26, 2023

sffc commented Apr 27, 2023

robertbastian commented May 10, 2023

sffc commented May 11, 2023

sffc commented May 11, 2023

robertbastian commented May 17, 2023

robertbastian commented May 25, 2023

sffc commented May 25, 2023

robertbastian commented Jan 18, 2023 •

edited by sffc

Loading