Retrieve Canonical TZIDs #2909

nordzilla · 2022-12-20T22:56:31Z

For datetime formatting we need to have the canonical versions of the Time Zone identifiers (TZID). The CLDR time-zone identifiers never change and are not necessarily canonical, where as the time zone identifiers from the IANA database are allowed to update and change over time.

Some examples are the change of names from Pacific/Ponape to Pacific/Pohnpei, and Asia/Calcutta to Asia/Kolkata.

We can achieve this in two ways. Once the initial TZDB data is provider is landed, we could retrieve these IDs from the data provider, however we could also retrieve these identifiers from the CLDR BCP47 data, which was added in #606.

The text was updated successfully, but these errors were encountered:

sffc · 2022-12-20T23:23:10Z

Discussion: we want a many-to-one map from IANA to BCP47, and a one-to-one from BCP47 to IANA. We must read the CLDR sources to get the mappings to BCP47, but it seems like we should probably favor the TZIF sources for the most up-to-date IANA names.

nordzilla · 2022-12-22T18:28:11Z

This situation may be a bit more complicated due to the potential and inevitable out-of-sync-ness between IANA and CLDR. I'd like to bring up a particular example for discussion: a change to IANA between tzdb-2022e and tzdb-2022f.

I'll describe my initial plan for retrieving the canonical TZIDs and how the above change could cause issues given this approach.

Initial Plan

My initial plan was to simply generate the TZif data from the IANA database using the Makefile option BACKWARD=, which disables backward-compatible name aliases (i.e. the TZif files and TZID's will only the current and up-to-date IANA TZID's).

The full data creation generation process would be as follows:

From CLDR, generate a many-to-one ICU4X mapping from all known TZIDs (canonical and aliased) to BCP47 IDs using the CLDR BCP47 data. (This work is already complete.)
Given the mapping generated in step 1), generate from the IANA database a new one-to-one ICU4X mapping from BCP47 IDs to up-to-date IANA IDs.

The Problem

This method of generating data would work as long as the CLDR BCP47 data and the IANA data are up to date with each other.

However, as mentioned above, there is a change between the e and f versions of the IANA database that would cause problems in the above model, which is that Pacific/Pohnpei was merged with Pacific/Guadalcanal.

Before
tzdb-2022e

TZID(Pacific/Pohnpei):
    Alias(Pacific/Ponape) 
    BCP47(fmpni)

TZID(Pacific/Guadalcanal):
    BCP47(sbhir)

After
tzdb-2022f

TZID(Pacific/Guadalcanal):
    Alias(Pacific/Pohnpei)
    Alias(Pacific/Ponape)
    BCP47(sbhir)

As such, it appears that the BCP47 fmpni will no longer be used, however the CLDR BCP47 file is not up to date with this change.

This would cause an issue where the ICU4X data pulled from CLDR maps Pacific/Pohnpei and Pacific/Ponape to fmpni which would have no corresponding IANA time-zone ID in the ICU4X data pulled from IANA itself.

What should happen is that Pacific/Pohnpei and Pacific/Ponape should be mapped to sbhir, which is linked to Pacific/Guadalcanal.

Using the above model, this won't happen until the CLDR BCP47 data is updated to match the current IANA data.

Solutions

I would like to discuss some solutions:

Manually parse more of the IANA database ourselves such that we can resolve these discrepancies in data provider and ensure that the mapping is up to date. This would allow us to solve some issues, such as when two existing TZIDs get merged as in the case described above, but it wouldn't allow us to solve discrepancies if a new TZID is added to IANA and there is no entry for it at all in CLDR or BCP47.
Determine that it is not our responsibility to resolve discrepancies between different version of IANA and CLDR. Data should be generated with versions that were released around the same time, and when new changes to IANA are made, we just have to wait for the CLDR data to catch up.
Other alternatives (open to discussion)

nordzilla · 2022-12-22T19:28:26Z

@sffc I've labled this as discuss priority, though we won't have another meeting for a while due to various holiday schedules.

If you have any thoughts about this async, I'd love to continue this discussion on here as well.

sffc · 2022-12-22T22:41:01Z

How much more difficult is it to do (1) rather than (2)?

I think it's not the end of the world if we do (2) and just wait for CLDR releases in order to get the most up-to-date set of names, if that is the easiest solution. It seems that we could make (1) be a "nice to have" improvement. But, if (1) is easy (less than a few hours' work), then we should do it since I think we all agree it is the better solution.

sffc · 2023-01-05T19:09:14Z

Discussion:

@sffc If IANA adds a new time zone, CLDR won't have the BCP-47 alias for it yet. Therefore, we should probably favor CLDR here. We aren't locking ourselves into anything since this is wholly a datagen-time configuration.

justingrant · 2023-02-28T09:26:47Z

The CLDR time-zone identifiers never change and are not necessarily canonical, where as the time zone identifiers from the IANA database are allowed to update and change over time.

Some examples are the change of names from Pacific/Ponape to Pacific/Pohnpei, and Asia/Calcutta to Asia/Kolkata.

@sffc pointed me to an encouraging sign: CLDR is investigating how to address the out-of-date alias problem. See CLDR-14453.

Note that AFAICT there are only 13 CLDR zones that use an out-of-date canonical identifier. So until CLDR figures out how to address the problem, maintaining a hard-coded list of <20 overrides (assuming the list grows a few per year) seems like a small price to pay to avoid the out-of-sync problem.

Besides staying in sync, another reason to favor CLDR data is avoiding IANA's aggressive merging of unrelated time zones. Here are some of the canonicalizations in the latest (2022g) IANA data:

Most Balkan countries => Europe/Belgrade
Sweden, Denmark, and much of central Europe => Europe/Berlin
Atlantic/Reykyavik => Africa/Abidjan (a different continent!)
Various northern-Canada zones => America/Panama (?)
20+ African countries => Africa/Maputo, Africa/Lagos, Africa/Johannesburg, or Africa/Abidjan
Almost every Caribbean country => America/Puerto_Rico

Here's an excerpt from a CLDR-14453 comment explaining the problem in more detail.

I found it helpful to classify Links in IANA TDZB as “synonyms” or “merges”.

Synonyms - these are equivalent zones, like Asia/Calcutta vs. Asia/Kolkata, or PRC vs. Asia/Shanghai. Regardless of how time zones change in the future, these will always be the same.

Merges - these are zones representing different locations that just happen to share the same time zone rules since 1970, but some future change might cause them to diverge.

Synonyms are good, but merges can be problematic for a few reasons:

Data loss - When canonical identifiers are persisted (e.g. timestamps like 2025-01-25T10:00[Africa/Abidjan] for a future meeting in Reykjavik) valuable metadata about that timestamp is lost, leaving it brittle if Côte d'Ivoire or Iceland changes its time zone.

Cultural sensitivity - Time zone identifiers shouldn’t be shown to end-users, but even among technical users it’s likely to cause lots of confusion and frustration (also trolling and bad press) if every Balkan zone redirects to Belgrade, most Central European zones redirect to Berlin, and 20+ African countries merge down to 4 zones.

Backwards compatibility - Only 10% of the merges in IANA TZDB are currently followed by CLDR. (To verify, go here and filter by merges.) If tomorrow suddenly all merges were followed, it’d probably break a lot of apps.

CLDR aliases are only synonyms, not merges. This makes CLDR's canonicalization avoid the problems above.

So I'd be hesitant to use the IANA source for canonicalization purposes.

sffc · 2023-04-22T00:52:25Z

Just noting that this issue now has 4 user tags on it, which means a lot of clients need it. We should prioritize implementing this issue in 1.3.

Note: I am assuming that two-way conversion between BCP-47 and IANA is in scope of this issue.

justingrant · 2023-06-07T06:37:06Z

@sffc - I assume that you didn't mean to close this issue via #3498, right?

justingrant · 2023-06-07T06:40:18Z

Also, while I'm here: there will likely need to be a way to retrieve the case-normalized variant of a time zone idenfiier, e.g. asia/ulaanbaatar => Asia/Ulaanbaatar, because TZDB identifiers are case-insensitive so any case should be accepted as input but output should always be case-normalized to match TZDB, even if those outputs are not canonicalized to the canonical Zone name in TZDB.

sffc · 2023-06-07T07:37:54Z

OK I'm implementing this based on the CLDR data:

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-bcp47/bcp47/timezone.json

I need to pick between "first in list" or "last in list". I'm currently using "last in list", which gives me Asia/Kolkata, but it gives me things like US/Central as the canonical ID for America/Chicago. I hope this is fixed up very soon with https://unicode-org.atlassian.net/browse/CLDR-14453

there will likely need to be a way to retrieve the case-normalized variant of a time zone idenfiier

Out of scope of my current PR, but we should make a follow-up issue for this, if we think it is in scope of ICU4X

sffc · 2023-09-11T05:30:26Z

The new spec in unicode-org/cldr#3105 says:

To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the alias attribute in the <type> element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID. In addition to this, iana attribute specifies the preferred ID in the tz database if it's different from the CLDR canonical "long" ID.

So I will change my PR to read the first element of the list, and when CLDR 44 is rolled in, we can update datagen to start consuming the new iana field.

The PR also changes the data so that not every row has an alias field (adding a preferred field instead), so further changes may be needed here in CLDR 44.

nordzilla mentioned this issue Dec 20, 2022

Tracking Issue: TimeZones DataProviders #992

Open

2 tasks

nordzilla changed the title ~~Retrieve Canonical TZIDs from CLDR bcp47 data~~ Retrieve Canonical TZIDs Dec 20, 2022

nordzilla added the discuss Discuss at a future ICU4X-SC meeting label Dec 20, 2022

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 20, 2022

nordzilla added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters labels Dec 22, 2022

nordzilla added the discuss-priority Discuss at the next ICU4X meeting label Dec 22, 2022

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jan 5, 2023

sffc assigned nordzilla Jan 5, 2023

sffc added this to the 1.x Priority ⟨P2⟩ milestone Jan 5, 2023

sffc added the U-fuchsia User: Fuchsia label Jan 25, 2023

sffc mentioned this issue Feb 2, 2023

Support obtaining BCP-47 ids from metazone names #3066

Closed

Manishearth added the U-flutter User: Flutter Engine or Dart SDK label Feb 2, 2023

hsivonen added the U-gecko User: Gecko label Feb 14, 2023

sffc mentioned this issue Feb 27, 2023

Reconsider time zone canonicalization behavior given forking of IANA Time Zone Database tc39/proposal-temporal#2509

Closed

sffc added the U-google User: Google 1st party label Apr 22, 2023

sffc modified the milestones: 1.x Priority ⟨P2⟩, 1.3 Blocking ⟨P1⟩ Apr 22, 2023

sffc unassigned nordzilla May 11, 2023

sffc self-assigned this Jun 7, 2023

sffc mentioned this issue Jun 7, 2023

Re-use UnvalidatedStr impls in NormalizedPropertyNameStr #3498

Merged

sffc closed this as completed in #3498 Jun 7, 2023

sffc mentioned this issue Jun 7, 2023

Add IANA/BCP47 time zone name mappings #3499

Closed

Manishearth reopened this Jun 7, 2023

hsivonen mentioned this issue Aug 30, 2023

Provide list of available named time zones #3970

Closed

sffc closed this as completed in #4024 Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve Canonical TZIDs #2909

Retrieve Canonical TZIDs #2909

nordzilla commented Dec 20, 2022 •

edited

Loading

sffc commented Dec 20, 2022

nordzilla commented Dec 22, 2022 •

edited

Loading

nordzilla commented Dec 22, 2022

sffc commented Dec 22, 2022

sffc commented Jan 5, 2023

justingrant commented Feb 28, 2023

sffc commented Apr 22, 2023 •

edited

Loading

justingrant commented Jun 7, 2023

justingrant commented Jun 7, 2023 •

edited

Loading

sffc commented Jun 7, 2023

sffc commented Sep 11, 2023

Retrieve Canonical TZIDs #2909

Retrieve Canonical TZIDs #2909

Comments

nordzilla commented Dec 20, 2022 • edited Loading

sffc commented Dec 20, 2022

nordzilla commented Dec 22, 2022 • edited Loading

nordzilla commented Dec 22, 2022

sffc commented Dec 22, 2022

sffc commented Jan 5, 2023

justingrant commented Feb 28, 2023

sffc commented Apr 22, 2023 • edited Loading

justingrant commented Jun 7, 2023

justingrant commented Jun 7, 2023 • edited Loading

sffc commented Jun 7, 2023

sffc commented Sep 11, 2023

nordzilla commented Dec 20, 2022 •

edited

Loading

nordzilla commented Dec 22, 2022 •

edited

Loading

sffc commented Apr 22, 2023 •

edited

Loading

justingrant commented Jun 7, 2023 •

edited

Loading