Revisit resource_key identifier being TinyStr16 #1148

zbraniecki · 2021-10-04T22:52:02Z

In DateTimeFormat we are growing the number of keys and the 16 char limit becomes noticable.

We'd like to use idenifiers like gregory_pattern_lengths and gregory_skeletons but we can't.

We should discuss increasing the length.

The text was updated successfully, but these errors were encountered:

sffc · 2021-12-09T19:45:31Z

Discussion: the current situation is that the TinyStr16 is trying to be both human-readable and machine-readable at the same time, and it does not serve either use case very well. We should instead give each data key both a machine-readable and a human-readable version. The machine-readable version could be a globally unique TinyStr4; @hsivonen suggested making a registry similar to macOS. The human-readable version can simply be a &'static str.

CC @iainireland

sffc · 2021-12-30T19:31:16Z

Here is a caveat. Static data slicing is built on top of the principle that it can scan an executable file and recover well-defined ResourceKey instances that are present inside. I think this means that the ResourceKey needs to be "self-contained" with no references, so that we can find a block of bytes in the executable and parse it into a ResourceKey.

I can therefore see two possible representations:

A Copy + Sized representation like we currently have
A VarULE-like unsized representation

The problem with (1) is that we still need to choose a maximum length. If we choose something too big, it's going to start having a code size and performance impact (since we need to store and copy around a lot of bytes). (2) solves this problem but comes with its own challenges.

I think (2) would look something like

#[repr(C)]
pub struct ResourceKey {
    /// A prefix tag that we can find in the executable
    _prefix: [u8; 8],
    /// Number of bytes in the human_key field (max 256)
    _human_key_len: u8,
    /// Machine-readable key used for lookup in, e.g., BlobDataProvider
    pub machine_key: [u8; 4],
    /// Human-readable key used for lookup in, e.g., FsDataProvider
    pub human_key: str,
}

With (2), ResourceKey would become something passed by reference, which means we would need to either:

Always pass them by &'static reference (don't create one on the fly), or
Pass them by normal (short-lived) reference and Clone if you need to keep it for longer

Seeking input from:

Manishearth · 2021-12-30T19:42:30Z

I'm not a fan of using unsized types too often. Furthermore, if "copying around the data" is going to be a problem in (1) it's definitely going to be a problem for (2) since a lot of those copies will turn into allocations. Overall stack copies are not expensive in my experience, but dealing with unsized types often is.

Modern computers can perform single instruction copies for more than just u128, even TinyStr32 would be quite manageable when it comes to stack copies. We're talking about the copy tradeoff between a pointer move and at worst a 4-qword move.

If you want something efficient, consider using some kind of perfect hash function paired with the tinystr, we can use (u16, TinyStr16)

robertbastian · 2022-01-03T10:35:54Z

Here is a caveat. Static data slicing is built on top of the principle that it can scan an executable file and recover well-defined ResourceKey instances that are present inside. I think this means that the ResourceKey needs to be "self-contained" with no references, so that we can find a block of bytes in the executable and parse it into a ResourceKey.

If we put the tagging behind a feature, it wouldn't restrict our runtime representation.

zbraniecki · 2022-01-07T19:37:57Z

I have a 0.7 opposition to option (2).

To add to what Manish and Robert said, I'd also say that the problem statement for (1):

The problem with (1) is that we still need to choose a maximum length

is imho solvable. I believe there is an exponential distribution of preferable lengths of keys with majority comfortably within 16 chars, and most of the outliers comfortably within 24-26 chars.
I'd expect that 32 may be a sweet spot and if we can accept that hit, we wouldn't need to consider extensions in the future.

sffc · 2022-01-07T19:56:48Z

If you all are happy with 32 bytes for the human-readable version, I'm happy with that, too. It's much easier to implement.

sffc · 2022-01-07T21:24:23Z

The underlying types we want are

pub struct ResourceKey {
    _tag0: [u8; 8],
    machine: [u8; 4],
    human: [u8; 32],
    _tag1: [u8; 4],
}

The typed version is

pub struct ResourceKey {
    _tag0: AsciiULE<8>,
    machine: RawBytesULE<4>,
    human: AsciiULE<32>,
    _tag1: AsciiULE<4>,
}

robertbastian · 2022-01-07T22:19:40Z

If we include a human readable string anyway we can just use strings to extract the keys from a binary. See https://github.com/unicode-org/icu4x/pull/1480/files for a working prototype. Then we can also use &'static str and don't have to juggle bytes around.

sffc · 2022-01-08T00:08:59Z

Yes, although we lose the link between human and machine in the static data slicing tool if we only dig for the human. But maybe that's alright.

I guess my other concern is that we don't have control over how the compiler lays out strings. It could intern the strings so that we miss keys that are overlapping, for example. We may be able to solve this by tagging the strings with both a prefix and a suffix.

zbraniecki added C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting S-small Size: One afternoon (small bug fix or enhancement) labels Oct 4, 2021

zbraniecki mentioned this issue Oct 4, 2021

Separate SkeletonPatterns into its own key #1139

Merged

sffc self-assigned this Dec 9, 2021

sffc added the v1 label Dec 9, 2021

sffc added this to the ICU4X 0.5 milestone Dec 9, 2021

sffc mentioned this issue Dec 21, 2021

Make ResourcePath compatible with [Var]ZeroVec #243

Closed

sffc added the needs-approval One or more stakeholders need to approve proposal label Dec 30, 2021

sffc mentioned this issue Dec 30, 2021

Make from_repr_c be safe and fallible #1457

Closed

sffc mentioned this issue Jan 6, 2022

Make calendars work better with static data slicing #1461

Closed

2 tasks

This was referenced Jan 11, 2022

Add FFI for constructing Data Structs, including decimal data structs #1497

Merged

Add fxhash_32 #1504

Merged

Add icu4x-key-extract for Static Data Slicing #1460

Closed

Re-write ResourceKey #1511

Merged

sffc modified the milestones: ICU4X 0.5, 2021 Q4 0.5 Sprint F Jan 18, 2022

sffc closed this as completed in #1511 Jan 19, 2022

sffc removed discuss Discuss at a future ICU4X-SC meeting needs-approval One or more stakeholders need to approve proposal labels Jan 20, 2022

robertbastian mentioned this issue Feb 2, 2022

Implementing IterableProvider for FsDataProvider and BlobDataProvider #1506

Closed

sffc removed the v1 label Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit resource_key identifier being TinyStr16 #1148

Revisit resource_key identifier being TinyStr16 #1148

zbraniecki commented Oct 4, 2021

sffc commented Dec 9, 2021

sffc commented Dec 30, 2021 •

edited

Loading

Manishearth commented Dec 30, 2021

robertbastian commented Jan 3, 2022

zbraniecki commented Jan 7, 2022

sffc commented Jan 7, 2022

sffc commented Jan 7, 2022

robertbastian commented Jan 7, 2022

sffc commented Jan 8, 2022

Revisit resource_key identifier being TinyStr16 #1148

Revisit resource_key identifier being TinyStr16 #1148

Comments

zbraniecki commented Oct 4, 2021

sffc commented Dec 9, 2021

sffc commented Dec 30, 2021 • edited Loading

Manishearth commented Dec 30, 2021

robertbastian commented Jan 3, 2022

zbraniecki commented Jan 7, 2022

sffc commented Jan 7, 2022

sffc commented Jan 7, 2022

robertbastian commented Jan 7, 2022

sffc commented Jan 8, 2022

sffc commented Dec 30, 2021 •

edited

Loading