Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Data sources #5

Closed
zbraniecki opened this issue Oct 21, 2019 · 5 comments
Closed

Data sources #5

zbraniecki opened this issue Oct 21, 2019 · 5 comments

Comments

@zbraniecki
Copy link
Member

The three crates I maintain - intl_pluralrules, unic-langid and unic-locale currently all store their data in the source code tables generated as a separate step.

This provides a very good performance, but is inflexible and potentially may lead to data duplication.

It would be good for all crates around Unicode in Rust to have cohesive data loading models. I suggest two:

  • baked in data as an optional feature
  • data loaded from optimized resource files

We'd need to design such format, write tooling for building such resource files and API for loading them into memory.

@zbraniecki
Copy link
Member Author

Also, we'd need to design compatibility policies - which version of a crate supports which range of CLDR/Unicode data.

@sffc
Copy link
Member

sffc commented Oct 21, 2019

Let's talk about data sources here, and data compatibility across versions in #4

@zbraniecki
Copy link
Member Author

I did some experimentation around this among the crates I maintain, with mixed results.

My current thinking is that we want to have some sort of CLDRDataManager struct which would facilitate loading resources and maintaining in-memory cache of them.

This way, we'd avoid Locale and PluralRules both fetching the same JSON/XML/XXX file from hard drive to read the CLDR data.

I'd like such manager to handle at least two types of resources - raw, either XML or JSON (leaning toward JSON), and "optimized" - some binary format that we'll come up that will make reading into memory instantaneous.

I imagine such manager holding data corresponding to the current CLDR resource split and be able to fetch likelySubtags resource or layout for pl, parse it into a basic struct (maintained by serde or some other JSON parser for JSON CLDR)

Then, all crates would by default have a constructor like this:

pub fn new_with_manager(..., cldr: &CLDRDataManager) -> Self;

and optionally a set of features which would allow them to use baked-in data, such as we have right now.

So, no matter if you have data baked in or not, you can always construct with the manager, but if you have data baked in, you can also construct without it, and the methods that require CLDR data will be available.

WDYT?

@sffc
Copy link
Member

sffc commented Nov 20, 2019

Hmm.

I'd like such manager to handle at least two types of resources - raw, either XML or JSON (leaning toward JSON), and "optimized" - some binary format that we'll come up that will make reading into memory instantaneous.

The "data manager" (I use the term "data provider") should, I think, expose a stable, versioned schema of the data to the consumer (like unic-locale). That schema should be set up such that the consumer needs to do minimal processing in order to use it in code.

Now, the data provider could get its data from a flatfile, a service, etc. It could support reading either CLDR JSON or a file with prebuilt data structures for better efficiency. However, that should be an internal of the data provider and not be exposed to the data consumer.

Then, all crates would by default have a constructor like this:

pub fn new_with_manager(..., cldr: &CLDRDataManager) -> Self;

I haven't found a great API solution. On the JavaScript side, my favorite is something along the lines of,

let dataProvider = // ...
let formatter = await Intl.Async(dataProvider).NumberFormat();

That only works if you're building a global namespace, like we are with ECMA-402. An alternative, for Rust, could be to have a trait like,

trait LocaleDataConsumer {
    fn with_data_provider(&mut self, &omnicu::DataProvider dataProvider) -> &Self;
}

Most likely, the implementer of this trait would essentially rebuild the object using data from the provider, and not keep a long-lived reference to the provider. If a data provider never gets provided, then the implementer can return errors (or panic) if methods requiring data are used.

An alternative, which may be easier but may be considered bad practice, would be to set a global data provider.

// locale.rs
static mut dataProvider: Option<omnicu::DataProvider> = None;

impl Locale {
  pub fn set_global_data_provider(&omnicu::DataProvider _dataProvider) {
    dataProvider = Some(_dataProvider);
  }
}

and optionally a set of features which would allow them to use baked-in data, such as we have right now.

Yeah, a compile-time flag to turn baked-in data on or off sounds like a good thing. Not sure which should be the default.

At the end of the day, the data provider will take time to design properly and should be aligned with tc39/ecma402#210 and the needs of Omni ICU; we shouldn't jump into something before we think it all the way through.

This was referenced Jan 10, 2020
@sffc
Copy link
Member

sffc commented Jun 30, 2020

The doc data-pipeline.md in the ICU4X repo covers these issues well.

The issue of a global data provider is being discussed in unicode-org/icu4x#150

@sffc sffc closed this as completed Jun 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants