-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Implements ICU version of getDefaultLocale, fixes up ICU extension handling #3820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dilijev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed at jackhorton#1
| locale = callInstanceFunc(StringInstanceReplace, locale, match[2], ""); | ||
| } | ||
| } else { | ||
| // Windows' getDefaultLocale() will return a weird RFC4646 langtag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol -- "weird" is a funny way of saying something from an outdated standard, but yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a quirk of WinGlob not Windows per se, but the program logic makes that clear
|
Tell me more about the fix to |
|
Sorry, it was actually an issue in resolveLocaleHelper that only surfaced in supportedLocalesOf -- it was the change to ICU appending only "-" to the string instead of "-u-" because the getExtensions JS fallback returned ["u", "co", "phonebk"] for "de-u-co-phonebk", whereas I think the WinGlob native implementation returns just ["co", "phonebk"]. |
46112c9 to
7a66537
Compare
|
@jackhorton I think maybe the WinGlob platform implementation of getExtensions actually returns an array of "key-value" sequences like "co-phonebk" and ignores the singletons like "-u" entirely. I noticed this issue as well and added a comment to the code in my PR. Maybe handling it is now effectively fixed in your change -- we can clean up later. |
| return platform.builtInRegexMatch(GetDefaultLocale(), /([^_]*).*/)[1]; | ||
| if (isPlatformUsingICU) { | ||
| const def = platform.getDefaultLocale(); | ||
| const match = platform.builtInRegexMatch(def, LANG_TAG_EXT_RE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LANG_TAG_EXT_RE) [](start = 58, length = 16)
Ensure this regex is not undefined first. If it is undefined, run the initializer function function. This pattern is used elsewhere.
e.g.:
if (!LANG_TAG_RE) {
InitializeLangTagREs();
}
dilijev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
dilijev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First resolve that initialization issue and regenerate the bytecode and ensure that the IntlICU build passes all tests except for Intl.
429a3ec to
d779273
Compare
dilijev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
boingoing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| if (isPlatformUsingICU) { | ||
| // ICU's getDefaultLocale() will return a valid BCP-47/RFC 5646 langtag | ||
| locale = GetDefaultLocale(); | ||
| const match = platform.builtInRegexMatch(locale, /-u(-[^\-][^\-]?-[^\-]+)*-co-([^\-]+).*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(-[^\-][^\-]?-[^\-]+) [](start = 64, length = 21)
nits (TODO in a future update):
You don't use the first capture group so make it non-matching with ?:
Add comments explaining what this is matching for (or rename match to something more descriptive). (Edit: eh, if (match) collation = is pretty clear, nevermind. A comment might still be nice.
| const match = platform.builtInRegexMatch(locale, /-u(-[^\-][^\-]?-[^\-]+)*-co-([^\-]+).*/); | ||
| if (match) { | ||
| collation = match[2]; | ||
| locale = callInstanceFunc(StringInstanceReplace, locale, `-co-${match[2]}`, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match[2] [](start = 80, length = 8)
nit (TODO): you just set collation = match[2], so use ${collation} instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, do we intentionally still replace -co-value instead of stripping all subtags? (I'm guessing this mapping possibly involves other subtags?)
nit (TODO): (Come to think of it reverseLocaleAcceptingCollationValues needs a comment or example explaining what the resulting object looks like, because I hate having to use a debugger or reasonable about that code every time I want to answer that question.)
nit (TODO): (Come to think of it, running a bunch of JS code for something that is clearly something we could pre-compute is a bad idea. We should just cache the result directly in the script.)
In reply to: 141544228 [](ancestors = 141544228)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know (and as far as its variable name suggests), localesAcceptingCollationValues is only for collation values. As for whether the other extensions need to be stripped, as of right now, this function is only used as the defaultLocaleFunc for resolveLocales when initializing the Collator (initializing NumberFormat and DateTimeFormat uses strippedDefaultLocale instead). That internally calls platform.getExtensions, which falls back to getExtensionSubtags(locale), which is basically just
locale
.match(LANG_TAG_EXT_RE)[0] // get extensions on the locale string
.split('-') // split them into an array
.filter(x => !!x) // remove empty elementsSo, should we strip all subtags? No, probably not. Also, thinking about it, we probably shouldn't be stripping all subtags for NumberFormat or DateTimeFormat either. Id imagine the original reason for the different functions was that Collator needed the system collation value, so it went through the process of stripping "_collation" and mapping and re-formatting it, while NumberFormat and DateTimeFormat didn't need the collation, so they could blindly throw away the "_collation" part of the string, because they knew they could get -ca and -nu formats later from platform.getExtensions. In ICU, we cant do that right now and possibly don't need to do that ever. Also, since we are caching the results now, the perf cost of calculating the mapped collation value (if we even need to, re: the comment below) is 0 for the Number and Date case.
TL;DR: Stripping the locale for ICU is at best useless and at worst harmful, so I think we should actually ditch the ICU implementation of strippedDefaultLocale regardless
| locale = callInstanceFunc(StringInstanceReplace, locale, `-co-${match[2]}`, ""); | ||
| } | ||
| } else { | ||
| // Windows' getDefaultLocale() will return a weird RFC4646 langtag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird [](start = 57, length = 5)
nit (TODO): I think we should remove "weird" since we're just editorializing. It simply is an RFC4646 langtag, and weird here makes it sound like something about the tags are not quite RFC4646. Maybe maybe the distinction as: "RFC4646 (not RFC5646/BCP47) langtag"
| const collationMapForLocale = reverseLocaleAcceptingCollationValues[locale]; | ||
| if (collationMapForLocale === undefined) { | ||
| // Assume the system wouldn't give us back a bad collation value | ||
| __mappedDefaultLocale = `${locale}-u-co-${collation}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${locale}-u-co-${collation}; [](start = 36, length = 30)
Have you exercised this path with a hypothetical default locale containing -u- ?
For some reason I think we'd be adding a duplicate singleton -u which (at least according to Intl spec) is not allowed in langtags.
https://tc39.github.io/ecma402/#sec-isstructurallyvalidlanguagetag 6.2.2. (bullet 3) "does not include duplicate singleton subtags."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I think this should be resolved by making __mappedDefaultLocale = `${locale}-co-${collation}`
| var bcpTag = availableBcpTags[collation]; | ||
| if (bcpTag !== undefined) { | ||
| return locale + "-u-co-" + bcpTag; | ||
| const collationMapForLocale = reverseLocaleAcceptingCollationValues[locale]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverseLocaleAcceptingCollationValues [](start = 38, length = 37)
I think the whole point of these mappings is to translate RFC 4646 (or associated standard/database/list) collation values into RFC 5646 (or associated) collation values. See the Intl explainer for the RFC 5646-associated list.
IOW this shouldn't be necessary for ICU because the langtag should already be RFC 5646.
However, since this value is in theory reported by the system, it might still be a problem. I'd like to think that ICU would convert or ignore anything it gets from Windows that isn't valid according to ICU. But I guess we can't assume that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a lookup in a dictionary once during the lifetime of the program is worth not needing to worry about this, I think. We may want to re-evaluate that decision if we make these dictionaries into a native-calculated, hardcoded thing (then, we could just not have this dictionary at all if ICU is enabled).
| if (mappedCollation !== undefined) { | ||
| __mappedDefaultLocale = `${locale}-u-co-${mappedCollation}`; | ||
| } else { | ||
| __mappedDefaultLocale = `${locale}-u-co-${collation}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same -- have you exercised these paths? The Intl harness should be helpful here.
| } else { | ||
| resolved = undefined; | ||
| } | ||
| return setPrototype({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (TODO): add a blank line before the return.
| if (lpLocaleName) | ||
| UErrorCode error = UErrorCode::U_ZERO_ERROR; | ||
| char bcp47[ULOC_FULLNAME_CAPACITY]; | ||
| char defaultLocaleId[ULOC_FULLNAME_CAPACITY]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mused about this in comments in other places -- it's not too expensive to zero-out these arrays = { 0 } and it's safer in case something goes wrong.
Now that we run this function only once on Intl initialization (thanks to caching the returned value in Intl.js), it seems worth it to pay that cost here.
| char defaultLocaleId[ULOC_FULLNAME_CAPACITY]; | ||
|
|
||
| int32_t written = uloc_getName(nullptr, defaultLocaleId, ULOC_FULLNAME_CAPACITY, &error); | ||
| if (U_FAILURE(error) || written <= 0 || written >= ULOC_FULLNAME_CAPACITY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (U_FAILURE(error) || written <= 0 || written >= ULOC_FULLNAME_CAPACITY) [](start = 7, length = 75)
nit: success branch first for readability (as you did below), unless it's easier to express the failure case in an if conditional, or if you only do something in case of failure (basically if the else is the success path, it hurts readability).
To work around this, personally I will sometimes declare a boolean named (something like) success or failed and then negate if necessary depending on what I actually want to express for the if condition.
In this case, personally I think the success case reads better: (U_SUCCESS(error) && written > 0 && written < CAPACITY)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a closer look at a few places. (Sorry for adding yet another review.)
Also note there's an incoming change from @boingoing that I'm merging from release/1.7 into intl-icu that we'll need to rebase on, and then do a proper bytecode regen (with build) to avoid nasty merge issues later.
d779273 to
da28d6e
Compare
da28d6e to
a045bda
Compare
|
@dotnet-bot test this please |
|
@mmitche is something broken? |
|
@dilijev Jenkins upgrade hit a bunch of github api rate limit issues |
| var reverseLocaleAcceptingCollationValues = (function () { | ||
| var toReturn = setPrototype({}, null); | ||
| // reverses the keys and values in each locale's sub-object in localesAcceptingCollationValues | ||
| // localesAcceptingCollationValues[locale][key] = value -> reverseLocalesAcceptingCollationValues[locale][value] = key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the explainer comment!
e1fbd73 to
dba6a9d
Compare
…Format and node-cc build integration into branch `release/1.7` Merge pull request #3840 from dilijev:intl-icu Includes the following PRs merged to branch `intl-icu` * [MERGE #3809 @dilijev] Intl-ICU: implement Intl.NumberFormat under ICU * [MERGE #3820 @jackhorton] Implements ICU version of getDefaultLocale, fixes up ICU extension handling * [MERGE #3822 @obastemur] xplat: fix the path for i18n * [MERGE #3750 @jackhorton] Expose platform.winglob to Intl.js to detect which i18n lib we are using in the native code.
…: Intl.NumberFormat and node-cc build integration into branch `release/1.7` Merge pull request #3840 from dilijev:intl-icu Includes the following PRs merged to branch `intl-icu` * [MERGE #3809 @dilijev] Intl-ICU: implement Intl.NumberFormat under ICU * [MERGE #3820 @jackhorton] Implements ICU version of getDefaultLocale, fixes up ICU extension handling * [MERGE #3822 @obastemur] xplat: fix the path for i18n * [MERGE #3750 @jackhorton] Expose platform.winglob to Intl.js to detect which i18n lib we are using in the native code.
Closes #3658