-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add icu_preferences util #1833
Add icu_preferences util #1833
Conversation
utils/preferences/tests/dtf.rs
Outdated
} | ||
|
||
impl DateTimeFormat { | ||
pub fn new(locale: &Locale, prefs: &DTFPreferences) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion: If we add the preferences bag as a separate argument, we end up with every formatter constructor taking 4 arguments: locale, provider, options, and preferences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should. I'm ok with preferences being part of options. This is just an example.
utils/preferences/tests/dtf.rs
Outdated
} | ||
|
||
impl DateTimeFormat { | ||
pub fn new(locale: &Locale, prefs: &DTFPreferences) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I want to make the Locale
argument be a trait (see #1662). Should we consider making that trait also encompass the preferences bag? For example, plain Locale
can implement the trait, and so can DTFPreferences
, which would wrap the Locale
and add extra overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that preferences shouldn't be folded into Locale
. Preferences can be resolved without locale, and I am planning to make the resolver/constructor also accept None
for Locale
and just resolve between Preferences and ResolvedPreferences.
@Manishearth do you have any suggestion on how to approach handling of |
Yes, this is correct, and the plan is that the "regular" DateTimeFormat type will handle AnyCalendars, and what we currently call DateTimeFormat will instead be called TypedDateTimeFormat or something |
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@sffc, @Manishearth - requesting early feedback on the discussed model. I'm introducing generic I focused only on architecture and DX so far and I think it starts looking quite nice. One hiccup I found so far architecturally is that if I pass The other imperfection I see is that I haven't looked at all at performance/memory of this approach. I'd appreciate your early feedback on that as well since I'm trying to avoid switching to that mode until DX is complete. |
I think this looks pretty promising! |
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@sffc - updated based on our HLD. Unfortunately, we didn't capture notes so I was working from memory. Please let me know how it looks to you now and if there's something I missed. |
I have some scattered thoughts. But before I get into detail on those, I want to step back and look at what the resulting DTF constructor looks like. I'd like to start there and work backwards. I see several paths, all of which have pros and cons:
Options 5 and 6 are more aligned with my mental model coming in. It is also what ECMA-402 is doing. Inside the constructor, we new Intl.DateTimeFormat("en").resolvedOptions().calendar
// 'gregory'
new Intl.DateTimeFormat("en-u-ca-hebrew").resolvedOptions().calendar
// 'hebrew'
new Intl.DateTimeFormat("en-u-ca-hebrew", { calendar: "japanese" }).resolvedOptions().calendar
// 'japanese' My understanding is that @zbraniecki's main concern with using a locale to carry preferences is that we may want to scale to preferences that don't fit into BCP-47. My response to that is that Note: |
This is not what I propose here. I propose just it is intentional, as:
Why would we do that? I did it before and the problem is that the trait is mainly useful to have an implicit
What would it give us? This makes Locale a superset of Preferences, which it is not.
How is |
Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")
We discussed this as an option in our meeting and the idea was so that foreign types could implement the trait and be compatible with ICU4X, but I think it's infeasible if every formatter has a separate trait, so I agree; we can dismiss this option.
There would be a single
Yes, a single struct for all components. I would like to understand more about your concern of scaling that solution. As far as I know, your only concern was the ability to support non-BCP-47 values. I think is a tractable problem that can be supported in a single struct: in the future, we add more fields (public or private) to hold more structured data. Please note that the data provider framework wants everything in a single standard format (be it Locale or DataLocale or otherwise). It wants to work the same way across components. |
Note: my envisioned data model for DataLocale is a Language Identifier + map from Unicode Extension Key to Unicode Extension Value. (This is exactly what ResourceOptions is currently.) |
They are part of
It is not my only one. My concern is that operating on a union of all possible fields that any component may need is going to lead to awkward DX and limit DCE. It will also mean that such type has to keep getting extended for every option that any component needs and that requires revision of
We discussed that maybe in the future we'll want to add such "union" type, but I'd prefer not to and I'd prefer not to start with it. The type needed by DP is suboptimal for internal use by components - in particular, an open ended key-value list can contain invalid values. One of the benefits of the design I'm providing here is that the validation of the values is enforced when working with I understand that for DP that is not the case, and I recognize that we will want to add such |
OK, so this model essentially removes the locale as an explicit argument? Have we considered the implications of this on people knowing how and where to pass in the locale? I think it's a good design choice on the part of both ICU4C and ECMA-402 to basically always require a standalone locale argument.
I don't see how this affects either DX or DCE.
It's a map. More items can be added to the map without requiring code revision, so long as they conform to BCP-47.
That's a nice benefit, I agree. I also think that this case is fine to stick in an error variant if the values are invalid.
Understand that whatever type we add here needs to have a conversion into the DP type right off the bat. I see this as a primary motivator of the signature of formatter constructors. |
Correct.
I'm not convinced anymore, hence this design. My take is that Locale input from the perspective of formatters is fallible. In the API I'm proposing this is moved outside of the constructor giving developers ability to decide how to handle failures and allowing us to provide more nuanced methods like I'm growing belief that new Intl.DateTimeFormat("en-US", {
region: "CA",
});
What is a programmatic preference in your mental model? Is I argue that they are all user preferences, just provided in two ways (preferences bag vs locale) and merged. It is also lossless one way - you can create preferences bag from locale which is a fallible operation, but the other way may be lossy, but infallible. I agree that it is nice to allow for both - locale and preferences bag - to be provided, but I don't think they need to be provided to the constructor. It is sufficient that they can be merged prior to construction. |
Thanks for explaining your mental model more. I think it is a valid model, and a perspective to consider, but I'm not yet convinced that it is superior to the proven traditional model, and I'm hesitant to use ICU4X as a guinea pig to validate it.
To me, this seems like a solution without a real-world problem:
Yes, but I don't see how this particular belief translates into the structured all-in-one preferences bag proposal.
In my mental model, the bag of preferences comes from the operating system in the form of a BCP-47 string, a well-understood, extremely popular serialization format. It may contain However, So,
A major reason why I still think it's wise to keep the locale in the constructor signature is that it reminds the developer where they need to inject the locale. If they just provide a bag of options, they are at high risk of specifying date style and time style, for instance, and unintentionally leaving language, region, etc. at their default values. This problem could be partially mitigated by forcing Making it clear where developers need to put their locale parameter is, in my mind, the most basic tenant of i18n library design. |
Architecture of an API should be designed along the axis of fallible/infallible and lossy/lossless. Locale as is is fallible and lossy compared to Preferences. I don't understand your argument of "solution without a problem" - we have a "problem" - we need to design API to encode information and provide DX. I provide a solution that I see as superior to what has been done, reflecting the ICU4X API composition and easily usable from ECMA-402 APIs.
I think your mental model is incomplete here. "preferences bag" does not come from OS. It can come from developer choice, or from user preferences stored in the application. Locale can come from OS, or from third-party system or can be stored as a user preference in the app as well. Both can have multiple sources and the The API I'm proposing allows for composition of those models including one you evaluate, but also allowing for others including ones where developer has preferences, and user has preferences, and parent provides preferences and allows for chained multi-level combination of those in any order that the developer prefers allowing them to decide how to handle fallibility and lossiness. The model used by ECMA-402 does not allow for any of that - it forces one architecture, forces particular failure resolution and doesn't allow for composition.
That's a good point. I'll think about how to incorporate that into the DX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a lot of nitpicks, which I won't post yet since I'm not sold on the architecture, but I thought I'd post one to start.
utils/preferences/src/preferences.rs
Outdated
fn region(&self) -> Option<&Region> { | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There is more than one concept of a region. There is the region in the sense of a localized version of a language, and there is a region in the sense of setting default preferences for things like currencies and measurement units. This distinction is handled automatically in BCP-47, but if you are forcing a locale into the Preferences mold, you should support both types of regions so that we don't lose information across the conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this question about language region versus preference region remains unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you refer to region subtag vs rg
unicode extension.
TR35 calls the former region
and the latter region_override
with definition:
A Region Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the element), instead of using the region specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the unicode_language_subtag).
I think we should follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need both types of region for data loading. For example, a British person living in the US who wants to make a MeasureFormat may want British translations ("metre") and American preferences ("miles per gallon") in the same constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that DP may want to have access to more than just LID. In that case we will pass Preferences
(or DataLocale::from(Preferences)
to DP
rather than Preferences::LID
. This will allow DP to reason about region
and region_override
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the other type of region get stored? Does this basically mean that every preferences object has a region_override field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every Preference
that has rg
as a relevant extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm not sure if I like codifying whether or not rg
is a relevant extension in each and every Preference
struct; I see that more as a data provider decision. CLDR could decide to change a certain piece of data from being language-based to region-based in a future edition; for example, currency symbols are currently being discussed as a possible candidate for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm not sure if I like codifying whether or not rg is a relevant extension in each and every Preference struct; I see that more as a data provider decision.
I disagree. It is a potential Preferences field and each struct should decide if it is. I can see it being converted to Option
in DataLocale
where preferences that have relevant and present will use Some
and everything else will result in None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the set of what extension keys are relevant is dependent on the data keys reachable from a constructor. That set is changeable, so we should definitely at least make the preferences struct be non_exhaustive
Hmm, I wonder if there's a way to derive Preferences from the ResourceProvider bound?
814d03b
to
847296d
Compare
@sffc ready for your review. Can you mainly review the tests to verify that the architecture fits your expectations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to land; I have only small issues at this point. The shape looks good.
pub list: &'static [DateTimeFormatResolvedPreferences], | ||
} | ||
|
||
const DEFAULT_PREFS: DefaultPrefs = DefaultPrefs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I think most components won't want to use this exact pattern for DefaultPrefs; they will likely want to load their data first and then populate the ResolvedPreferences based on that. Consider adding a comment saying that this is just a test, not the way we expect all clients to write DefaultPrefs.
$( | ||
_ if *s == unicode::value!($key) => $name::$variant $(($v2::try_from(s).ok()))?, | ||
$( | ||
$(_ if *s == unicode::value!($key, $subk) => $name::$variant(Some($v2::$subv))),*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This parses the unicode::Value each time over the match statement. It is more efficient to use writeable_cmp_bytes
so you don't need to parse anything.
($value:literal, $value2:literal) => {{ | ||
let v: &str = concat!($value, "-", $value2); | ||
let R: $crate::extensions::unicode::Value = | ||
match $crate::extensions::unicode::Value::try_from_bytes(v.as_bytes()) { | ||
Ok(r) => r, | ||
_ => panic!(concat!("Invalid Unicode extension value: ", $value)), | ||
}; | ||
R | ||
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This branch of the macro does runtime parsing and can't be used in a const. It doesn't do anything you can't do in userland; users should just use concat!($value, "-", $value2).parse::<unicode::Value>()
.
Custom(unicode::Value) | ||
} | ||
|
||
impl TryFrom<&unicode::Value> for $name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: unknown values go to $name::Custom
. What is the error case?
//XXX: Perf | ||
let s = input.to_string(); | ||
let v = TinyAsciiStr::from_str(&s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: use the doc-hidden fn input.as_single_subtag()
; you can consider making it public if you like
"rgsa" => Rgsa | ||
}); | ||
|
||
enum_keyword!(Calendar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: this list is missing some aliases. Some of the others might be, too, but I haven't checked. We should do a thorough review of all these enums before 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this enum still doesn't include aliases: is that intentional?)
23696e2
to
79e8ba9
Compare
9063db8
to
573bb04
Compare
I think this is the final shape before I start carving out pieces for landing. Changes:
@sffc - can you take a last look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Architecture still looks good, but there are details. Might be nice to make smaller PRs but I'm also okay landing the whole thing and working on it in tree.
/// ``` | ||
/// use icu::locid::extensions::Extensions; | ||
/// | ||
/// Extensions::try_from_bytes(b"u-hc-h12").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we maybe didn't have this in part because it's unclear whether it should include the leading -
or not, as in -u-hc-h12
. I think the ToString impl adds the -
but I'm not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to become opinionated, but I'm aligned that parse should be more lenient (maybe support both) and serialization should be more strict.
crate::enum_keyword!(Collation { | ||
"standard" => Standard, | ||
"search" => Search, | ||
"phonetic" => Phonetic, | ||
"pinyin" => Pinyin, | ||
"searchjl" => Searchjl | ||
}, "co"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This is far from a comprehensive list of collations. Also, I think this should probably be a struct (open enum) instead of a closed enum. @hsivonen might have thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "search" doesn't belong here, since it's for a different mode of usage. It makes sense for the user to have a preference between "stroke" and "zhuyin". It does not make sense for the user to express a preference between "stroke" and "search".
I'm not quite sure what the exact semantics of "search" vs. "searchjl" are, but 1) we don't have a search API at this point anyway and 2) I believe it's a contextual application design decision whether a given search operation with the Korean UI locale should use "search" or "searchjl".
So let's not put either "search" or "searchjl" here.
That isn't enough to decide what to put here.
As I understand it, the main use case for this is deciding whether for Traditional Chinese you want zhuyin
instead of the default-for-zh-Hant
stroke
. stroke
makes sense for any of the varieties of Chinese (Mandarin, Cantonese, etc.), but zhuyin
is Mandarin-focused. (I don't know if using zhuyin
sorting for e.g. Hakka is nonetheless a thing that user do.)
From theorizing in the reverse, one might except it to make sense for a user to override the default Mandarin-focused pinyin
collation order of zh-Hans
to stroke
when not using Mandarin, but I don't know what the actual user practice is.
Other than that choosing between Mandarin-phonetic (pinyin
and zhuyin
) vs. stroke
for zh
, it's not really clear how strong the use cases are. compat
for Arabic looks a lot like something that's about ICU change management from long ago (as were big5han
and gb2312
). Is standard
vs. dict
for Sinhala something that makes sense as a user preference? Is standard
vs. phonebk
for German something that makes sense as a user preference? Theoretically, standard
vs. trad
for Spanish looks potentially something that could make sense as user preference, but what's the level of demand, really? standard
vs. trad
for Finnish and Swedish is being removed. I have no idea of the user demand for standard
vs. trad
for Bangla.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the warnings I put on MDN for an idea of why this shouldn't just be "everything the LDML spec has" if the purpose is to model user preferences: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale/getCollations (When I wrote that, I thought the und
locale worked in Intl.Collator
and didn't realize if was an illusion of testing with en-US
browser UI locale…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, I appreciate your feedback and I'm glad to discuss it in context of introducing particular keys. For now I wanted to verify that the architecture works for us. We'll discuss this key in a PR for "co".
"rgsa" => Rgsa | ||
}); | ||
|
||
enum_keyword!(Calendar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this enum still doesn't include aliases: is that intentional?)
|
||
#[inline] | ||
fn writeable_length_hint(&self) -> writeable::LengthHint { | ||
todo!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to find and replace any todo!()
s before merging
|
||
struct_keyword!( | ||
RegionOverride, | ||
"ro", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is ro
? Did you mean rg
?
use tinystr::TinyAsciiStr; | ||
|
||
struct_keyword!( | ||
Timezone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timezone, | |
TimeZoneId, |
|
||
use crate::enum_keyword; | ||
|
||
enum_keyword!(Variant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be an open enum (struct), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we'll discuss it in the PR specific for this key, but my logic here is that we should narrow down the keys to only "known" values - e.g. values we actually may want to act on in the code.
In other words - the fact that "va" may have any subtag is not meaningful for this crate. If we have nowhere in ICU4X where we operate on the particular value of "va" then it's ok to skip "va" - it's meaningless from the perspective of icu_preferences
.
If in the next version of ICU4X we will start having logic specific to va=foo
then we will in the next version add Foo
as a variant of this enum.
The only open question for me is if we'd like to support Custom
which might be useful for non-ICU4X users of icu_preferences
- for example some crate like Fluent may want to use icu_preferences
and have logic that uses a variant that we don't have logic for. But the way I wrote the architecture allows us to add it in a backward compatible way incrementally.
prefs.merge_locale(loc); | ||
let dtf = DateTimeFormat::new(prefs, Default::default()); | ||
assert_eq!(dtf.format(0), String::from("Monday, June 23rd 2022, 24:13")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ I don't think we decided for sure on the borrowing and ownership considerations.
use crate::parser::ParserError; | ||
use crate::subtags::Region; | ||
|
||
impl_tinystr_subtag!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I'm not completely sure about whether this is the right place to put this subdivision logic. It does largely make sense in icu_locid, but as you pointed out, we don't currently return any values from icu_locid that make use of this type. I also don't necessarily think we should continue making the icu_locid APIs more complex. I think I could get behind a function on extensions::Unicode
that returned a typesafe subdivision or region override.
First patch from the #1833 patchset. It moves the `Subtag` to `locid::subtags::Subtag` - this is a generic subtag (2..=8, separate from private::Subtag which is 1..=8) which will be used to iterate over multi-part subtags like `extensions::unicode::Value` to retrieve specific items without relying on the underlying `TinyStr` storage. It's a breaking change because we are removing `extensions::other::Subtag`. I assume this is ok as we're working on 2.0 so I don't need to add alias and deprecate it, but please confirm.
Second part of #1833. This one cleans up from/to str for locid extensions. This prepares ground for reuse of Subtag in Value API which will be introduced in the next PR.
This is part of #1833 switching Value API to use Subtag.
Fixes #419.
This is an early WIP that finally captures what I had in mind for #419. It also creates a foundation for #594 (the data will have to come from DataProvider).