Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement locale fallbacking in data loading #1109

Closed
sffc opened this issue Sep 26, 2021 · 9 comments · Fixed by #2206
Closed

Implement locale fallbacking in data loading #1109

sffc opened this issue Sep 26, 2021 · 9 comments · Fixed by #2206
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Sep 26, 2021

We have discussed the design around locale fallbacking in #173. This issue is to track the implementation of locale fallbacking in the data provider pipeline. An implementation-focused design doc will likely be needed.

Related issues:

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring) labels Sep 26, 2021
@sffc sffc added this to the ICU4X 0.5 milestone Sep 26, 2021
@sffc sffc self-assigned this Sep 26, 2021
@sffc sffc added S-epic Size: Major project (create smaller child issues) and removed S-large Size: A few weeks (larger feature, major refactoring) labels Sep 26, 2021
@sffc sffc modified the milestones: ICU4X 0.5, ICU4X 0.6 Jan 9, 2022
@sffc
Copy link
Member Author

sffc commented Jan 9, 2022

This is not going to make 0.5 because of outstanding design issues.

Soft-blocked on #1488, pending a resolution to #1462.

@sffc
Copy link
Member Author

sffc commented Jun 14, 2022

I wrote a design doc for this:

https://docs.google.com/document/d/1Mp7EUyl-sFh_HZYgyeVwj88vJGpCBIWxzlCwGgLCDwM/edit#

There are still some open algorithm questions, which are mostly minor, but there is one big one that I still need to answer. There are three ways to represent languages in terms of resource option strings:

  1. Always language+script
  2. Language+script for multi-script languages; otherwise, only language
  3. Only language for the default script; otherwise, language+script

Consider the following examples and how the languages get represented in each:

Language Option 1 Option 2 Option 3
Italian it-Latn it it
Cyrillic Serbian sr-Cyrl sr-Cyrl sr
Latin Serbian sr-Latn sr-Latn sr-Latn
Simplified Chinese zh-Hans zh-Hans zh
Traditional Chinese zh-Hant zh-Hant zh-Hant

Notes:

  • Italian is a single-script language in CLDR
  • Default script in Serbian is Cyrillic
  • Default script in Chinese is Simplified

Each of the three options has downsides:

  1. With Option 1, we need to carry more data, and lookups become slower for the common case because we need to read the script a lot more often
  2. With Option 2, languages can change from single-script to multi-script in arbitrary CLDR releases, resulting in unstable behavior across CLDR releases for users who load data directly without using vertical fallback
  3. With Option 3, we get surprising behavior in regions that use a non-default script unless vertical fallback is used; for example, "sr-ME" loads Cyrillic (should be Latin), and "zh-TW" loads Simplified (should be Traditional)

To be clear, all three options work if the vertical fallback adapter is used; however, the behavior differs for clients who choose to leave out vertical fallback for performance reasons (e.g. all locales hard-coded at compile time).

I was going to implement Option 2, but I discovered that English is a multi-script language, meaning that English goes to the slow path. Therefore, we should reconsider Options 1 and 3, or consider going with Option 2 but with a hand-curated list of multi-script languages since it seems unlikely that CLDR will be adding "en-Shaw" data.

Thoughts?

@sffc
Copy link
Member Author

sffc commented Jun 14, 2022

I thought of an Option 4 that should solve this; I'll post an update to the design doc.

@Manishearth
Copy link
Member

Manishearth commented Jun 14, 2022

for users who load data directly without using vertical fallback

I mentioned this in the document but my impression of folks avoiding fallback would be of two kinds:

A. Folks on resource constrained environments where they do not want to pay the cost
B. Folks (especially rustaceans stumbling across a "rust i18n lib" without full context) who are trying to use ICU4X for something small and want it to be simple.

For A. we can perhaps make sure fallbacking is the most convenient (or reasonably conveneint) option when it comes to using the library.

For B, perhaps we can provide a way to pre-fallback during datagen? These environments almost always are targeting a more closed set of locales, and they could in theory list the set of potential request locales during datagen and we could generate simpler fallback, generate a report as to what locales should be used, or store the keys with fallback "unapplied" (e.g. if you know that the app is only using hi you can just store the maximized hi-Deva-IN key as hi instead)

Basically, we can offer tools to mitigate the problems for both. They may not be sufficient, I'm not entirely happy with the tradeoff.

@Manishearth
Copy link
Member

You could also make this work with strategy 1 where you can codegen a simpler, faster fallback data provider that does the mapping for you.

We could even give users a choice: We could say that it is up to the user to pick one of these two options:

  • either run datagen without a list of potential input locales AND use vertical fallback (option 1)
    • if you know all your input locales you can also generate a faster vertical fallback function as rust code
  • or run datagen with a list of potential input locales and opt out of vertical fallback, instead getting keys stored as what your input will be

unsure if this is clean thoough

@sffc
Copy link
Member Author

sffc commented Jun 14, 2022

The situation I'm trying to strive for is a data file that supports both pre-resolved locales and vertical fallback locales at the same time. For example, a data pack can say "I support fully-resolved locales X, Y, and Z out of the box, and more if you enable vertical fallback". I think it's bad if "zh-HK" is one of the fully-resolved locales but implies "zh-Hans-HK" instead of "zh-Hant-HK".

I updated the design doc with a solution that I believe solves the problem. Since we need to support parent locales anyway, we should just go with the ICU/CLDR standard of adding in the script during fallback resolution. We still have normalization, but we leave this part to fallback instead of normalization. I made this the default choice in the design doc.

An additional advantage of that solution is that it does not require mapping CLDR locales to ICU4X locales; we can use the CLDR locales directly.

@Manishearth
Copy link
Member

That makes sense. I think your approach is quite reasonable and consistent.

@sffc
Copy link
Member Author

sffc commented Jun 25, 2022

For associating the fallback configuration with the key, I think it should become a property of ResourceKey. This means that it shows up in the resource key definition in code.

Right now we mostly rely on the data_struct macro, which looks like

#[icu_provider::data_struct(
    CardinalV1Marker = "plurals/cardinal@1",
    OrdinalV1Marker = "plurals/ordinal@1"
)]

What do we start writing instead that is valid Rust syntax? Some options I was considering:

// Option 1: Add an extra nesting layer so that we can accept more arguments
#[icu_provider::data_struct(
    marker(CardinalV1Marker, path = "plurals/cardinal@1"),
    marker(OrdinalV1Marker, path = "plurals/ordinal@1", fallback_strategy = "region")
)]

// Option 2: Keep the current syntax but append extra info to the string
#[icu_provider::data_struct(
    CardinalV1Marker = "plurals/cardinal@1",
    OrdinalV1Marker = "plurals/ordinal@1[fallback_strategy=region]"
)]

I'm strongly leaning toward option 1 even though it is slightly more verbose. The data struct definitions are already extremely verbose. So, let's bikeshed option 1:

// Option 1a: more verbose
#[icu_provider::data_struct(
    marker(CardinalV1Marker, path = "plurals/cardinal@1", fallback_strategy = "region", extension_keyword = "nu"),
)]

// Option 1b: less verbose
#[icu_provider::data_struct(
    marker(CardinalV1Marker, path = "plurals/cardinal@1", fallback_by = "region", extension_kw = "nu"),
)]

// Option 1c: condensed
#[icu_provider::data_struct(
    marker(CardinalV1Marker, p="plurals/cardinal@1", fb="region", kw="nu"),
)]

// Option 1d: no key on the path with less verbose options
#[icu_provider::data_struct(
    marker(CardinalV1Marker, "plurals/cardinal@1", fallback_by = "region", extension_kw = "nu"),
)]

Any objections to 1d?

@Manishearth
Copy link
Member

1d seems fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants