Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LSTM segmentation data in DataProvider #905

Closed
makotokato opened this issue Jul 27, 2021 · 6 comments · Fixed by #2169
Closed

Support LSTM segmentation data in DataProvider #905

makotokato opened this issue Jul 27, 2021 · 6 comments · Fixed by #2169
Assignees
Labels
C-segmentation Component: Segmentation help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@makotokato
Copy link
Member

ICU4C and ICU4J use resource bundle format (*.res) that is binary blob. JSON files for LSTM data may be big size, so I would like to use anything binary format for it. Actually ICU4J's LSTM implementation can use *.res, so it is better if we can read it on ICU4X.

@zbraniecki zbraniecki added the discuss-priority Discuss at the next ICU4X meeting label Jul 27, 2021
@zbraniecki
Copy link
Member

@aethanyc can you confirm if this is a blocker or just performance nice-to-have for Segmenter?

@sffc
Copy link
Member

sffc commented Jul 27, 2021

We are using Postcard in ICU4X as an analog to *.res in ICU4C. You should use the DataProvider infrastructure to serve your data, and let ICU4X take care of the rest for you. For example, if you find yourself reading runtime data directly from *.json files, you're probably doing something wrong.

@sffc
Copy link
Member

sffc commented Jul 27, 2021

To be more specific: if you need to store an array of u16s, for example, you should have a data struct such as

#[icu_provider::data_struct]
#[derive(Debug, PartialEq, Clone, serde::Serialize, serde::Deserialize)]
pub struct LstmData<'data> {
    demo: ZeroVec<'data, u16>,
}

ICU4X is able to serialize that with Postcard to an optimized binary data file. You need to implement a "source" data provider to enable that piece of ICU4X's infrastructure to run.

@aethanyc
Copy link
Contributor

lstm.rs currently consumes the Thai training model using json, so this definitely blocking segmenter from being production ready.

I feel this issue is more like unicode-org/lstm_word_segmentation#16. That is, for the existing Thai and Burmese training data, do we have a binary data format that can be easily plugin into icu4x?

@sffc
Copy link
Member

sffc commented Jul 29, 2021

Example of making a new transformer: #885

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jul 29, 2021
@sffc sffc changed the title Support ResourceBundle format on DataProvider Support LSTM segmentation data in DataProvider Jul 29, 2021
@sffc sffc added C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality labels Jul 29, 2021
@sffc sffc added this to the ICU4X 0.4 milestone Jul 29, 2021
@sffc sffc modified the milestones: ICU4X 0.4, ICU4X 0.5 Oct 21, 2021
@aethanyc aethanyc pinned this issue Jan 20, 2022
@aethanyc aethanyc unpinned this issue Jan 20, 2022
@aethanyc
Copy link
Contributor

This is maybe related to #1426.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants