Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IANA/BCP47 time zone name mappings #3499

Closed
wants to merge 13 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Jun 7, 2023

Fixes #2909

The data structure is based heavily on the property names.

@@ -0,0 +1,599 @@
{
"map": {
"Africa/Abidjan": "ciabj",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This should benefit a bunch from AsciiTrie.

@sffc sffc marked this pull request as ready for review June 7, 2023 06:29
@sffc sffc requested review from robertbastian, Manishearth, nordzilla and a team as code owners June 7, 2023 06:29
@sffc sffc removed request for a team and robertbastian June 7, 2023 06:29
#[derive(PartialEq, Eq)] // VarULE wants these to be byte equality
#[derive(Debug, VarULE)]
#[cfg_attr(feature = "serde", derive(serde::Serialize))]
#[repr(transparent)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: we really should see if we can get someone to make the make_wrapper_varule macro

pub struct Bcp47ToIanaMapV1<'data> {
/// A map from BCP-47 time zone identifiers to IANA time zone identifiers
#[cfg_attr(feature = "serde", serde(borrow))]
pub map: ZeroMap<'data, UnvalidatedTinyAsciiStr<8>, str>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: we could store the str unvalidated and lazily validate as well (GIGO returning None when validation is not possible)

unclear if this is at all worth it

Manishearth
Manishearth previously approved these changes Jun 7, 2023
@@ -21,6 +21,11 @@ use tinystr::TinyAsciiStr;
use zerovec::ule::{AsULE, ULE};
use zerovec::{ZeroMap2d, ZeroSlice, ZeroVec};

pub mod names;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: not sure if we can say this fixes #2909 since this is just the data model, there's no fetcher struct yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed the API

@sffc
Copy link
Member Author

sffc commented Jun 7, 2023

This PR took about 4 hours to write, and it gives me the IANA-to-BCP47 mappings in all supported programming languages. Not bad in the grand scheme of things, but it still feels like there is a ton of boilerplate and manual steps that we should be able to automate. The actual business logic is very centralized in just a small handful of functions.

@sffc sffc requested review from Manishearth and removed request for zbraniecki and gregtatum June 7, 2023 07:42
Manishearth
Manishearth previously approved these changes Jun 7, 2023

impl ICU4XIanaToBcp47Mapper {
#[diplomat::rust_link(icu::timezone::IanaToBcp47Mapper::try_new_unstable, FnInStruct)]
pub fn create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is it worth exposing the getter directly as well?

@sffc
Copy link
Member Author

sffc commented Jun 8, 2023

Note to self: The CI failure is Checking crate=icu_capi features=[icu_datetime]

}
}

impl NormalizedTimeZoneIdStr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have Box and & constructors, but don't we need Cow to keep postcard alloc-free while allowing JSON?

I guess because this is only used in a ZV, postcard uses ULE instead of Serialize, but this should still work correctly in case someone uses this in a non-ULE location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More constructors can be added on an as-needed basis; I wanted to add the ones that are reachable. In fact I should probably double check if there are any I can delete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/constructors/serde impls

)]
#[cfg_attr(feature = "serde", derive(serde::Deserialize))]
#[yoke(prove_covariance_manually)]
pub struct IanaToBcp47MapV1<'data> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both IanaToBcp and BcpToIana seems wasteful, it's the same data in different order, and it's not small data. How about two ZeroMaps where the value is the index of the key in the other zeromap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's not exactly the same data; BCP47-to-IANA only contains canonical IDs, whereas IANA-to-BCP47 includes legacy and alias IANA IDs. Note that one is about 23% smaller than the other.
  2. By far the primary use case is IANA-to-BCP47. Most clients don't need BCP47-to-IANA, but it may be needed to support things like the Temporal API. Everyone else can slice out the other 12 kB.
  3. This follows the same design pattern as property names, which have separate keys for the two directions of mappings
  4. Making it a single data struct would increase data size for the common case (but I'm not sure by how much). Your proposed data model works well with ZeroMap; but I was also thinking that I would like to move these to AsciiTrie which doesn't support random indexing.

Would you like to discuss further?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will AsciiTrie be ready by the time is is released, or will that be V2 anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll experiment with some different data models, including experimental AsciiTrie, and report back the findings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some data size findings:

  • iana2bcp ZeroMap: 14475B (this PR)
  • bcp2iana ZeroMap: 11249B (this PR)
  • Combined iana/bcp with dual ZeroMaps: 15521B (Rob's suggestion)
  • iana2bcp AsciiTrie: 9820B

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a 7% size increase is worth paying for getting the other key for free. AsciiTrie will need a new data marker anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion: ok to go with AsciiTrie as long as it lands in time for 1.3

@sffc sffc requested a review from robertbastian June 12, 2023 23:57
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@sffc
Copy link
Member Author

sffc commented Jun 28, 2023

Blocked on #2722 landing first.

@sffc
Copy link
Member Author

sffc commented Aug 26, 2023

I'm planning to update this PR soon. I plan to remove the BCP-to-IANA mappings because they need to have a more careful design. CLDR and ECMA are both trying to agree on a standard mapping of canonical IANA aliases. We may end up wanting to return something more expressive such as an annotated set of multiple IANA names per BCP47 ID.

@sffc
Copy link
Member Author

sffc commented Aug 31, 2023

CC @hsivonen

Although this data structure is optimized for IANA-to-BCP47 mapping, it should be able to be used to in effect get the list of available time zones (#3970).

@robertbastian
Copy link
Member

Superseded^

@sffc sffc deleted the timezone-name-lookup branch October 25, 2023 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrieve Canonical TZIDs
3 participants