Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit resource_key identifier being TinyStr16 #1148

Closed
zbraniecki opened this issue Oct 4, 2021 · 9 comments · Fixed by #1511
Closed

Revisit resource_key identifier being TinyStr16 #1148

zbraniecki opened this issue Oct 4, 2021 · 9 comments · Fixed by #1511
Assignees
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement)

Comments

@zbraniecki
Copy link
Member

In DateTimeFormat we are growing the number of keys and the 16 char limit becomes noticable.

We'd like to use idenifiers like gregory_pattern_lengths and gregory_skeletons but we can't.

We should discuss increasing the length.

@zbraniecki zbraniecki added C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting S-small Size: One afternoon (small bug fix or enhancement) labels Oct 4, 2021
@sffc
Copy link
Member

sffc commented Dec 9, 2021

Discussion: the current situation is that the TinyStr16 is trying to be both human-readable and machine-readable at the same time, and it does not serve either use case very well. We should instead give each data key both a machine-readable and a human-readable version. The machine-readable version could be a globally unique TinyStr4; @hsivonen suggested making a registry similar to macOS. The human-readable version can simply be a &'static str.

CC @iainireland

@sffc sffc self-assigned this Dec 9, 2021
@sffc sffc added the v1 label Dec 9, 2021
@sffc sffc added this to the ICU4X 0.5 milestone Dec 9, 2021
@sffc
Copy link
Member

sffc commented Dec 30, 2021

Here is a caveat. Static data slicing is built on top of the principle that it can scan an executable file and recover well-defined ResourceKey instances that are present inside. I think this means that the ResourceKey needs to be "self-contained" with no references, so that we can find a block of bytes in the executable and parse it into a ResourceKey.

I can therefore see two possible representations:

  1. A Copy + Sized representation like we currently have
  2. A VarULE-like unsized representation

The problem with (1) is that we still need to choose a maximum length. If we choose something too big, it's going to start having a code size and performance impact (since we need to store and copy around a lot of bytes). (2) solves this problem but comes with its own challenges.

I think (2) would look something like

#[repr(C)]
pub struct ResourceKey {
    /// A prefix tag that we can find in the executable
    _prefix: [u8; 8],
    /// Number of bytes in the human_key field (max 256)
    _human_key_len: u8,
    /// Machine-readable key used for lookup in, e.g., BlobDataProvider
    pub machine_key: [u8; 4],
    /// Human-readable key used for lookup in, e.g., FsDataProvider
    pub human_key: str,
}

With (2), ResourceKey would become something passed by reference, which means we would need to either:

  1. Always pass them by &'static reference (don't create one on the fly), or
  2. Pass them by normal (short-lived) reference and Clone if you need to keep it for longer

Seeking input from:

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Dec 30, 2021
@Manishearth
Copy link
Member

I'm not a fan of using unsized types too often. Furthermore, if "copying around the data" is going to be a problem in (1) it's definitely going to be a problem for (2) since a lot of those copies will turn into allocations. Overall stack copies are not expensive in my experience, but dealing with unsized types often is.

Modern computers can perform single instruction copies for more than just u128, even TinyStr32 would be quite manageable when it comes to stack copies. We're talking about the copy tradeoff between a pointer move and at worst a 4-qword move.

If you want something efficient, consider using some kind of perfect hash function paired with the tinystr, we can use (u16, TinyStr16)

@robertbastian
Copy link
Member

Here is a caveat. Static data slicing is built on top of the principle that it can scan an executable file and recover well-defined ResourceKey instances that are present inside. I think this means that the ResourceKey needs to be "self-contained" with no references, so that we can find a block of bytes in the executable and parse it into a ResourceKey.

If we put the tagging behind a feature, it wouldn't restrict our runtime representation.

@zbraniecki
Copy link
Member Author

I have a 0.7 opposition to option (2).

To add to what Manish and Robert said, I'd also say that the problem statement for (1):

The problem with (1) is that we still need to choose a maximum length

is imho solvable. I believe there is an exponential distribution of preferable lengths of keys with majority comfortably within 16 chars, and most of the outliers comfortably within 24-26 chars.
I'd expect that 32 may be a sweet spot and if we can accept that hit, we wouldn't need to consider extensions in the future.

@sffc
Copy link
Member

sffc commented Jan 7, 2022

If you all are happy with 32 bytes for the human-readable version, I'm happy with that, too. It's much easier to implement.

@sffc
Copy link
Member

sffc commented Jan 7, 2022

The underlying types we want are

pub struct ResourceKey {
    _tag0: [u8; 8],
    machine: [u8; 4],
    human: [u8; 32],
    _tag1: [u8; 4],
}

The typed version is

pub struct ResourceKey {
    _tag0: AsciiULE<8>,
    machine: RawBytesULE<4>,
    human: AsciiULE<32>,
    _tag1: AsciiULE<4>,
}

@robertbastian
Copy link
Member

If we include a human readable string anyway we can just use strings to extract the keys from a binary. See https://github.com/unicode-org/icu4x/pull/1480/files for a working prototype. Then we can also use &'static str and don't have to juggle bytes around.

@sffc
Copy link
Member

sffc commented Jan 8, 2022

Yes, although we lose the link between human and machine in the static data slicing tool if we only dig for the human. But maybe that's alright.

I guess my other concern is that we don't have control over how the compiler lays out strings. It could intern the strings so that we miss keys that are overlapping, for example. We may be able to solve this by tagging the strings with both a prefix and a suffix.

@sffc sffc modified the milestones: ICU4X 0.5, 2021 Q4 0.5 Sprint F Jan 18, 2022
@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting needs-approval One or more stakeholders need to approve proposal labels Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants