-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IANA-to-BCP47 and reverse mappings #4024
Conversation
Design Question: The data store I'm using for IANA-to-BCP47 is pub struct IanaToBcp47MapV1<'data> {
/// A map from IANA time zone identifiers to indexes of BCP-47 time zone identifiers.
#[cfg_attr(feature = "serde", serde(borrow))]
pub map: ZeroTrie<ZeroVec<'data, u8>>,
/// A sorted list of BCP-47 time zone identifiers.
#[cfg_attr(feature = "serde", serde(borrow))]
// Note: this is 9739B as ZeroVec<TinyStr8> and 9335B as VarZeroVec<str>
pub bcp47_ids: ZeroVec<'data, TimeZoneBcp47Id>,
} For BCP47-to-IANA, I could simply do We have examples of multi-key dependencies, but I don't think they are the type that have invariants between them. In this case, the invariant would be that the lists correspond to each other (essentially the keys of the map are in one DataKey and the values of the map are in a different DataKey). Thoughts? @robertbastian @Manishearth |
Seems brittle. |
Not a huge fan of interkey dependencies where a broken dependency is not detectable (in this case 99% of the time a mismatch in versions would lead to buggy data in an undetectable way) Now, there is a way to solve that problem: we can store a hash of the index value with the map, and if they mismatch we throw an error. If we want to be really fancy we can even have the map store an This is in line with previous decisions that datagen configurability should not be used to add or omit data in a user-visible way but can be used for optimizations. |
a290442
to
8ff4151
Compare
Yeah, or how about this: we take the hash of the whole indices zerovec and store just that in both keys, like this: struct Foo<'data> {
bcp47_ids: ZeroVec<'data, TinyStr8>,
bcp47_ids_hash: u32,
// ... other data ...
}
struct Bar<'data> {
bcp47_ids: Option<ZeroVec<'data, TinyStr8>>,
bcp47_ids_hash: u32,
// ... other data ...
} The behavior would be:
Is this a great idea to save a kilobyte or two, or is it overengineering? |
🎉 All dependencies have been resolved ! |
Okay, here is what I ended up implementing. I made a checksum for the ZeroVec that was duplicated between the two keys, but I did not wire it up to be able to be duplicated in the second key. It is just always absent, and if the checksum is inconsistent between the two keys, the constructor fails. I didn't want to fiddle with datagen options for a situation that is unlikely to happen in the real world. If we ever do encounter this, we can add a new key or a V2 of the current key. The ZeroVec in question is about 3.5 KB which seems about the size where this type of thing could be justifiable. The data in postcard is 9749B (primary direction) and 7569B (reverse direction) which I'm quite happy with. For comparison, the non-ZeroTrie version was 14475B (primary direction) and 11249B (reverse with the deduplicated ZeroVec which ZeroTrie enabled). |
Are all the following statements true?
|
✅
I also still think
✅ |
let mut hasher = twox_hash::XxHash64::with_seed(0); | ||
for bcp47 in bcp47_ids.iter() { | ||
hasher.write(bcp47.0.all_bytes()); | ||
} | ||
let checksum2 = hasher.finish(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm this basically tests that the implementation of as_bytes
is the same as all_bytes
for each entry. While this is useful to test, it should be a test in zeroslice, not here.
This should test things like:
- If the order changes the checksums are different
- The checksum for
"abc", "def"
is different from the one of"abcd", "ef"
- The checksum for
"abc", "def", ""
is different from the one for"abc", "def", ""
- The checksum for the hardcoded list is equal to a hardcoded checksum (stability)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Discussion:
|
CC @Manishearth to weigh in on the hash choice (current decision is to use |
fn load(&self, _: DataRequest) -> Result<DataResponse<Bcp47ToIanaMapV1Marker>, DataError> { | ||
let resource: &cldr_serde::time_zones::bcp47_tzid::Resource = | ||
self.cldr()?.bcp47().read_and_parse("timezone.json")?; | ||
// Note: The BTreeMap retains the order of the aliases, which is important for establishing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the next CLDR release will include a new iana
attribute that, if present, overrides the alias order. See unicode-org/cldr#3105.
Does this PR handle that attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, CLDR 44 Alpha is not in scope for the 1.3 release, but this will be a priority for the 1.4 release. Filed #4044
@Manishearth said in #4030 (comment):
|
Based on @Manishearth's comment, I pre-emptively pushed another commit reverting SipHash back to XxHash, pending additional feedback from @robertbastian or @zbraniecki |
I think we should rule out |
I'm merging this and leaving the normalization follow-up for #4031. |
Fixes #2909
Depends on #4021
Depends on #4022
Depends on #4023
Replaces #3499