-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit resource_key identifier being TinyStr16 #1148
Comments
Discussion: the current situation is that the TinyStr16 is trying to be both human-readable and machine-readable at the same time, and it does not serve either use case very well. We should instead give each data key both a machine-readable and a human-readable version. The machine-readable version could be a globally unique TinyStr4; @hsivonen suggested making a registry similar to macOS. The human-readable version can simply be a CC @iainireland |
Here is a caveat. Static data slicing is built on top of the principle that it can scan an executable file and recover well-defined I can therefore see two possible representations:
The problem with (1) is that we still need to choose a maximum length. If we choose something too big, it's going to start having a code size and performance impact (since we need to store and copy around a lot of bytes). (2) solves this problem but comes with its own challenges. I think (2) would look something like #[repr(C)]
pub struct ResourceKey {
/// A prefix tag that we can find in the executable
_prefix: [u8; 8],
/// Number of bytes in the human_key field (max 256)
_human_key_len: u8,
/// Machine-readable key used for lookup in, e.g., BlobDataProvider
pub machine_key: [u8; 4],
/// Human-readable key used for lookup in, e.g., FsDataProvider
pub human_key: str,
} With (2), ResourceKey would become something passed by reference, which means we would need to either:
Seeking input from: |
I'm not a fan of using unsized types too often. Furthermore, if "copying around the data" is going to be a problem in (1) it's definitely going to be a problem for (2) since a lot of those copies will turn into allocations. Overall stack copies are not expensive in my experience, but dealing with unsized types often is. Modern computers can perform single instruction copies for more than just u128, even TinyStr32 would be quite manageable when it comes to stack copies. We're talking about the copy tradeoff between a pointer move and at worst a 4-qword move. If you want something efficient, consider using some kind of perfect hash function paired with the tinystr, we can use |
If we put the tagging behind a feature, it wouldn't restrict our runtime representation. |
I have a 0.7 opposition to option (2). To add to what Manish and Robert said, I'd also say that the problem statement for (1):
is imho solvable. I believe there is an exponential distribution of preferable lengths of keys with majority comfortably within 16 chars, and most of the outliers comfortably within 24-26 chars. |
If you all are happy with 32 bytes for the human-readable version, I'm happy with that, too. It's much easier to implement. |
The underlying types we want are pub struct ResourceKey {
_tag0: [u8; 8],
machine: [u8; 4],
human: [u8; 32],
_tag1: [u8; 4],
} The typed version is pub struct ResourceKey {
_tag0: AsciiULE<8>,
machine: RawBytesULE<4>,
human: AsciiULE<32>,
_tag1: AsciiULE<4>,
} |
If we include a human readable string anyway we can just use |
Yes, although we lose the link between human and machine in the static data slicing tool if we only dig for the human. But maybe that's alright. I guess my other concern is that we don't have control over how the compiler lays out strings. It could intern the strings so that we miss keys that are overlapping, for example. We may be able to solve this by tagging the strings with both a prefix and a suffix. |
In DateTimeFormat we are growing the number of keys and the 16 char limit becomes noticable.
We'd like to use idenifiers like
gregory_pattern_lengths
andgregory_skeletons
but we can't.We should discuss increasing the length.
The text was updated successfully, but these errors were encountered: