-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LSTM segmentation data in DataProvider #905
Comments
@aethanyc can you confirm if this is a blocker or just performance nice-to-have for Segmenter? |
We are using Postcard in ICU4X as an analog to |
To be more specific: if you need to store an array of #[icu_provider::data_struct]
#[derive(Debug, PartialEq, Clone, serde::Serialize, serde::Deserialize)]
pub struct LstmData<'data> {
demo: ZeroVec<'data, u16>,
} ICU4X is able to serialize that with Postcard to an optimized binary data file. You need to implement a "source" data provider to enable that piece of ICU4X's infrastructure to run. |
I feel this issue is more like unicode-org/lstm_word_segmentation#16. That is, for the existing Thai and Burmese training data, do we have a binary data format that can be easily plugin into icu4x? |
Example of making a new transformer: #885 |
This is maybe related to #1426. |
ICU4C and ICU4J use resource bundle format (
*.res
) that is binary blob. JSON files for LSTM data may be big size, so I would like to use anything binary format for it. Actually ICU4J's LSTM implementation can use*.res
, so it is better if we can read it on ICU4X.The text was updated successfully, but these errors were encountered: