-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieve Canonical TZIDs #2909
Comments
Discussion: we want a many-to-one map from IANA to BCP47, and a one-to-one from BCP47 to IANA. We must read the CLDR sources to get the mappings to BCP47, but it seems like we should probably favor the TZIF sources for the most up-to-date IANA names. |
This situation may be a bit more complicated due to the potential and inevitable out-of-sync-ness between IANA and CLDR. I'd like to bring up a particular example for discussion: a change to IANA between I'll describe my initial plan for retrieving the canonical TZIDs and how the above change could cause issues given this approach. Initial Plan My initial plan was to simply generate the TZif data from the IANA database using the The full data creation generation process would be as follows:
The Problem This method of generating data would work as long as the CLDR BCP47 data and the IANA data are up to date with each other. However, as mentioned above, there is a change between the Before
After
As such, it appears that the BCP47 This would cause an issue where the ICU4X data pulled from CLDR maps What should happen is that Using the above model, this won't happen until the CLDR BCP47 data is updated to match the current IANA data. Solutions I would like to discuss some solutions:
|
@sffc I've labled this as discuss priority, though we won't have another meeting for a while due to various holiday schedules. If you have any thoughts about this async, I'd love to continue this discussion on here as well. |
How much more difficult is it to do (1) rather than (2)? I think it's not the end of the world if we do (2) and just wait for CLDR releases in order to get the most up-to-date set of names, if that is the easiest solution. It seems that we could make (1) be a "nice to have" improvement. But, if (1) is easy (less than a few hours' work), then we should do it since I think we all agree it is the better solution. |
Discussion:
|
@sffc pointed me to an encouraging sign: CLDR is investigating how to address the out-of-date alias problem. See CLDR-14453. Note that AFAICT there are only 13 CLDR zones that use an out-of-date canonical identifier. So until CLDR figures out how to address the problem, maintaining a hard-coded list of <20 overrides (assuming the list grows a few per year) seems like a small price to pay to avoid the out-of-sync problem. Besides staying in sync, another reason to favor CLDR data is avoiding IANA's aggressive merging of unrelated time zones. Here are some of the canonicalizations in the latest (2022g) IANA data:
Here's an excerpt from a CLDR-14453 comment explaining the problem in more detail.
CLDR aliases are only synonyms, not merges. This makes CLDR's canonicalization avoid the problems above. So I'd be hesitant to use the IANA source for canonicalization purposes. |
Just noting that this issue now has 4 user tags on it, which means a lot of clients need it. We should prioritize implementing this issue in 1.3. Note: I am assuming that two-way conversion between BCP-47 and IANA is in scope of this issue. |
Also, while I'm here: there will likely need to be a way to retrieve the case-normalized variant of a time zone idenfiier, e.g. |
OK I'm implementing this based on the CLDR data: I need to pick between "first in list" or "last in list". I'm currently using "last in list", which gives me Asia/Kolkata, but it gives me things like US/Central as the canonical ID for America/Chicago. I hope this is fixed up very soon with https://unicode-org.atlassian.net/browse/CLDR-14453
Out of scope of my current PR, but we should make a follow-up issue for this, if we think it is in scope of ICU4X |
The new spec in unicode-org/cldr#3105 says:
So I will change my PR to read the first element of the list, and when CLDR 44 is rolled in, we can update datagen to start consuming the new The PR also changes the data so that not every row has an |
For datetime formatting we need to have the canonical versions of the Time Zone identifiers (TZID). The CLDR time-zone identifiers never change and are not necessarily canonical, where as the time zone identifiers from the IANA database are allowed to update and change over time.
Some examples are the change of names from
Pacific/Ponape
toPacific/Pohnpei
, andAsia/Calcutta
toAsia/Kolkata
.We can achieve this in two ways. Once the initial TZDB data is provider is landed, we could retrieve these IDs from the data provider, however we could also retrieve these identifiers from the CLDR BCP47 data, which was added in #606.
The text was updated successfully, but these errors were encountered: