-
Notifications
You must be signed in to change notification settings - Fork 2
Data sources #5
Comments
Also, we'd need to design compatibility policies - which version of a crate supports which range of CLDR/Unicode data. |
Let's talk about data sources here, and data compatibility across versions in #4 |
I did some experimentation around this among the crates I maintain, with mixed results. My current thinking is that we want to have some sort of This way, we'd avoid I'd like such manager to handle at least two types of resources - raw, either XML or JSON (leaning toward JSON), and "optimized" - some binary format that we'll come up that will make reading into memory instantaneous. I imagine such manager holding data corresponding to the current CLDR resource split and be able to fetch Then, all crates would by default have a constructor like this: pub fn new_with_manager(..., cldr: &CLDRDataManager) -> Self; and optionally a set of So, no matter if you have data baked in or not, you can always construct with the manager, but if you have data baked in, you can also construct without it, and the methods that require CLDR data will be available. WDYT? |
Hmm.
The "data manager" (I use the term "data provider") should, I think, expose a stable, versioned schema of the data to the consumer (like unic-locale). That schema should be set up such that the consumer needs to do minimal processing in order to use it in code. Now, the data provider could get its data from a flatfile, a service, etc. It could support reading either CLDR JSON or a file with prebuilt data structures for better efficiency. However, that should be an internal of the data provider and not be exposed to the data consumer.
I haven't found a great API solution. On the JavaScript side, my favorite is something along the lines of, let dataProvider = // ...
let formatter = await Intl.Async(dataProvider).NumberFormat(); That only works if you're building a global namespace, like we are with ECMA-402. An alternative, for Rust, could be to have a trait like, trait LocaleDataConsumer {
fn with_data_provider(&mut self, &omnicu::DataProvider dataProvider) -> &Self;
} Most likely, the implementer of this trait would essentially rebuild the object using data from the provider, and not keep a long-lived reference to the provider. If a data provider never gets provided, then the implementer can return errors (or panic) if methods requiring data are used. An alternative, which may be easier but may be considered bad practice, would be to set a global data provider. // locale.rs
static mut dataProvider: Option<omnicu::DataProvider> = None;
impl Locale {
pub fn set_global_data_provider(&omnicu::DataProvider _dataProvider) {
dataProvider = Some(_dataProvider);
}
}
Yeah, a compile-time flag to turn baked-in data on or off sounds like a good thing. Not sure which should be the default. At the end of the day, the data provider will take time to design properly and should be aligned with tc39/ecma402#210 and the needs of Omni ICU; we shouldn't jump into something before we think it all the way through. |
The doc data-pipeline.md in the ICU4X repo covers these issues well. The issue of a global data provider is being discussed in unicode-org/icu4x#150 |
The three crates I maintain - intl_pluralrules, unic-langid and unic-locale currently all store their data in the source code tables generated as a separate step.
This provides a very good performance, but is inflexible and potentially may lead to data duplication.
It would be good for all crates around Unicode in Rust to have cohesive data loading models. I suggest two:
We'd need to design such format, write tooling for building such resource files and API for loading them into memory.
The text was updated successfully, but these errors were encountered: