-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement locale fallbacking in data loading #1109
Comments
I wrote a design doc for this: https://docs.google.com/document/d/1Mp7EUyl-sFh_HZYgyeVwj88vJGpCBIWxzlCwGgLCDwM/edit# There are still some open algorithm questions, which are mostly minor, but there is one big one that I still need to answer. There are three ways to represent languages in terms of resource option strings:
Consider the following examples and how the languages get represented in each:
Notes:
Each of the three options has downsides:
To be clear, all three options work if the vertical fallback adapter is used; however, the behavior differs for clients who choose to leave out vertical fallback for performance reasons (e.g. all locales hard-coded at compile time). I was going to implement Option 2, but I discovered that English is a multi-script language, meaning that English goes to the slow path. Therefore, we should reconsider Options 1 and 3, or consider going with Option 2 but with a hand-curated list of multi-script languages since it seems unlikely that CLDR will be adding "en-Shaw" data. Thoughts? |
I thought of an Option 4 that should solve this; I'll post an update to the design doc. |
I mentioned this in the document but my impression of folks avoiding fallback would be of two kinds: A. Folks on resource constrained environments where they do not want to pay the cost For A. we can perhaps make sure fallbacking is the most convenient (or reasonably conveneint) option when it comes to using the library. For B, perhaps we can provide a way to pre-fallback during datagen? These environments almost always are targeting a more closed set of locales, and they could in theory list the set of potential request locales during datagen and we could generate simpler fallback, generate a report as to what locales should be used, or store the keys with fallback "unapplied" (e.g. if you know that the app is only using Basically, we can offer tools to mitigate the problems for both. They may not be sufficient, I'm not entirely happy with the tradeoff. |
You could also make this work with strategy 1 where you can codegen a simpler, faster fallback data provider that does the mapping for you. We could even give users a choice: We could say that it is up to the user to pick one of these two options:
unsure if this is clean thoough |
The situation I'm trying to strive for is a data file that supports both pre-resolved locales and vertical fallback locales at the same time. For example, a data pack can say "I support fully-resolved locales X, Y, and Z out of the box, and more if you enable vertical fallback". I think it's bad if "zh-HK" is one of the fully-resolved locales but implies "zh-Hans-HK" instead of "zh-Hant-HK". I updated the design doc with a solution that I believe solves the problem. Since we need to support parent locales anyway, we should just go with the ICU/CLDR standard of adding in the script during fallback resolution. We still have normalization, but we leave this part to fallback instead of normalization. I made this the default choice in the design doc. An additional advantage of that solution is that it does not require mapping CLDR locales to ICU4X locales; we can use the CLDR locales directly. |
That makes sense. I think your approach is quite reasonable and consistent. |
For associating the fallback configuration with the key, I think it should become a property of ResourceKey. This means that it shows up in the resource key definition in code. Right now we mostly rely on the #[icu_provider::data_struct(
CardinalV1Marker = "plurals/cardinal@1",
OrdinalV1Marker = "plurals/ordinal@1"
)] What do we start writing instead that is valid Rust syntax? Some options I was considering: // Option 1: Add an extra nesting layer so that we can accept more arguments
#[icu_provider::data_struct(
marker(CardinalV1Marker, path = "plurals/cardinal@1"),
marker(OrdinalV1Marker, path = "plurals/ordinal@1", fallback_strategy = "region")
)]
// Option 2: Keep the current syntax but append extra info to the string
#[icu_provider::data_struct(
CardinalV1Marker = "plurals/cardinal@1",
OrdinalV1Marker = "plurals/ordinal@1[fallback_strategy=region]"
)] I'm strongly leaning toward option 1 even though it is slightly more verbose. The data struct definitions are already extremely verbose. So, let's bikeshed option 1: // Option 1a: more verbose
#[icu_provider::data_struct(
marker(CardinalV1Marker, path = "plurals/cardinal@1", fallback_strategy = "region", extension_keyword = "nu"),
)]
// Option 1b: less verbose
#[icu_provider::data_struct(
marker(CardinalV1Marker, path = "plurals/cardinal@1", fallback_by = "region", extension_kw = "nu"),
)]
// Option 1c: condensed
#[icu_provider::data_struct(
marker(CardinalV1Marker, p="plurals/cardinal@1", fb="region", kw="nu"),
)]
// Option 1d: no key on the path with less verbose options
#[icu_provider::data_struct(
marker(CardinalV1Marker, "plurals/cardinal@1", fallback_by = "region", extension_kw = "nu"),
)] Any objections to 1d? |
1d seems fine |
We have discussed the design around locale fallbacking in #173. This issue is to track the implementation of locale fallbacking in the data provider pipeline. An implementation-focused design doc will likely be needed.
Related issues:
The text was updated successfully, but these errors were encountered: