Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IANA-to-BCP47 and reverse mappings #4024

Merged
merged 42 commits into from
Sep 20, 2023
Merged

Add IANA-to-BCP47 and reverse mappings #4024

merged 42 commits into from
Sep 20, 2023

Conversation

sffc
Copy link
Member

@sffc sffc commented Sep 11, 2023

Fixes #2909

Depends on #4021
Depends on #4022
Depends on #4023

Replaces #3499

@sffc
Copy link
Member Author

sffc commented Sep 11, 2023

Design Question:

The data store I'm using for IANA-to-BCP47 is

pub struct IanaToBcp47MapV1<'data> {
    /// A map from IANA time zone identifiers to indexes of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub map: ZeroTrie<ZeroVec<'data, u8>>,
    /// A sorted list of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    // Note: this is 9739B as ZeroVec<TinyStr8> and 9335B as VarZeroVec<str>
    pub bcp47_ids: ZeroVec<'data, TimeZoneBcp47Id>,
}

For BCP47-to-IANA, I could simply do ZeroMap<TimeZoneBcp47Id, String>. However, I could save a bit of space (probably on the order of 1 kB) by instead storing a VarZeroVec<String> where the indices in that map correspond to the indices of bcp47_ids in the other data key.

We have examples of multi-key dependencies, but I don't think they are the type that have invariants between them. In this case, the invariant would be that the lists correspond to each other (essentially the keys of the map are in one DataKey and the values of the map are in a different DataKey).

Thoughts? @robertbastian @Manishearth

@robertbastian
Copy link
Member

Seems brittle.

@Manishearth
Copy link
Member

Not a huge fan of interkey dependencies where a broken dependency is not detectable (in this case 99% of the time a mismatch in versions would lead to buggy data in an undetectable way)

Now, there is a way to solve that problem: we can store a hash of the index value with the map, and if they mismatch we throw an error. If we want to be really fancy we can even have the map store an Option<Indices> and you can configure datagen to emit a full index map when you know there is going to be a discrepancy in the two data keys.

This is in line with previous decisions that datagen configurability should not be used to add or omit data in a user-visible way but can be used for optimizations.

@sffc sffc force-pushed the new-iana branch 3 times, most recently from a290442 to 8ff4151 Compare September 12, 2023 08:00
@sffc
Copy link
Member Author

sffc commented Sep 12, 2023

we can store a hash of the index value with the map

Yeah, or how about this: we take the hash of the whole indices zerovec and store just that in both keys, like this:

struct Foo<'data> {
    bcp47_ids: ZeroVec<'data, TinyStr8>,
    bcp47_ids_hash: u32,
    // ... other data ...
}

struct Bar<'data> {
    bcp47_ids: Option<ZeroVec<'data, TinyStr8>>,
    bcp47_ids_hash: u32,
    // ... other data ...
}

The behavior would be:

  1. If the bcp47_ids_hash are the same in both data structs, use Foo::bcp47_ids
  2. If the bcp47_ids_hash are not the same, use Bar::bcp47_ids if it is Some
  3. If the hashes are not the same and Bar::bcp47_ids is None, return an error from the constructor

Is this a great idea to save a kilobyte or two, or is it overengineering?

@dpulls
Copy link

dpulls bot commented Sep 12, 2023

🎉 All dependencies have been resolved !

@sffc
Copy link
Member Author

sffc commented Sep 13, 2023

Okay, here is what I ended up implementing.

I made a checksum for the ZeroVec that was duplicated between the two keys, but I did not wire it up to be able to be duplicated in the second key. It is just always absent, and if the checksum is inconsistent between the two keys, the constructor fails. I didn't want to fiddle with datagen options for a situation that is unlikely to happen in the real world. If we ever do encounter this, we can add a new key or a V2 of the current key.

The ZeroVec in question is about 3.5 KB which seems about the size where this type of thing could be justifiable.

The data in postcard is 9749B (primary direction) and 7569B (reverse direction) which I'm quite happy with. For comparison, the non-ZeroTrie version was 14475B (primary direction) and 11249B (reverse with the deduplicated ZeroVec which ZeroTrie enabled).

@sffc sffc requested a review from robertbastian September 13, 2023 16:52
@sffc
Copy link
Member Author

sffc commented Sep 13, 2023

Are all the following statements true?

  1. The only blocking issue for this PR is the hash choice
  2. The only hash crates still in contention are twox_hash and crc32fast
  3. We consider both twox_hash and crc32fast to be robust, widely used crates that contain algorithms suitable for a checksum hash

@robertbastian
Copy link
Member

  1. The only blocking issue for this PR is the hash choice

  1. The only hash crates still in contention are twox_hash and crc32fast

I also still think SipHasher is a possible way forward, as we do not need any cryptographic properties. The worst that can happen is GIGO.

  1. We consider both twox_hash and crc32fast to be robust, widely used crates that contain algorithms suitable for a checksum hash

Comment on lines 117 to 121
let mut hasher = twox_hash::XxHash64::with_seed(0);
for bcp47 in bcp47_ids.iter() {
hasher.write(bcp47.0.all_bytes());
}
let checksum2 = hasher.finish();
Copy link
Member

@robertbastian robertbastian Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm this basically tests that the implementation of as_bytes is the same as all_bytes for each entry. While this is useful to test, it should be a test in zeroslice, not here.

This should test things like:

  • If the order changes the checksums are different
  • The checksum for "abc", "def" is different from the one of "abcd", "ef"
  • The checksum for "abc", "def", "" is different from the one for "abc", "def", ""
  • The checksum for the hardcoded list is equal to a hardcoded checksum (stability)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@robertbastian
Copy link
Member

Discussion:

@sffc sffc requested a review from robertbastian September 14, 2023 22:25
@sffc
Copy link
Member Author

sffc commented Sep 14, 2023

CC @Manishearth to weigh in on the hash choice (current decision is to use std::hash::SipHasher with an #[allow(deprecated)])

fn load(&self, _: DataRequest) -> Result<DataResponse<Bcp47ToIanaMapV1Marker>, DataError> {
let resource: &cldr_serde::time_zones::bcp47_tzid::Resource =
self.cldr()?.bcp47().read_and_parse("timezone.json")?;
// Note: The BTreeMap retains the order of the aliases, which is important for establishing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the next CLDR release will include a new iana attribute that, if present, overrides the alias order. See unicode-org/cldr#3105.

Does this PR handle that attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, CLDR 44 Alpha is not in scope for the 1.3 release, but this will be a priority for the 1.4 release. Filed #4044

@sffc
Copy link
Member Author

sffc commented Sep 14, 2023

@Manishearth said in #4030 (comment):

(we should not use SipHash, it's deprecated and nobody on the libs team wants people to use it for stuff like this. Since it is deprecated it's really not something that I would consider "maintained", even if it is in the stdlib.)

@sffc
Copy link
Member Author

sffc commented Sep 14, 2023

Based on @Manishearth's comment, I pre-emptively pushed another commit reverting SipHash back to XxHash, pending additional feedback from @robertbastian or @zbraniecki

@robertbastian robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Sep 15, 2023
@sffc
Copy link
Member Author

sffc commented Sep 15, 2023

I think we should rule out crc32fast because it only generates a 32-bit hash, and it mainly exists because of compiler intrinsics. I think we should choose between XxHash and SipHash.

@robertbastian robertbastian removed the discuss Discuss at a future ICU4X-SC meeting label Sep 19, 2023
robertbastian
robertbastian previously approved these changes Sep 19, 2023
@sffc
Copy link
Member Author

sffc commented Sep 20, 2023

I'm merging this and leaving the normalization follow-up for #4031.

@sffc sffc merged commit 95350a4 into unicode-org:main Sep 20, 2023
25 checks passed
@sffc sffc deleted the new-iana branch September 20, 2023 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrieve Canonical TZIDs
4 participants