-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add icu4x-key-extract for Static Data Slicing #1460
Conversation
version: u8, | ||
) -> ResourceKey { | ||
ResourceKey { | ||
_tag0: tinystr8!("ICURES[["), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to work for a BE binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will work so long as the fields are endian-agnostic. I changed the version field to be u8
instead of u16
to address this problem. The only field I'm not sure about is ResourceCategory
, but I left a TODO to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TinyStr in particular, there's a conversation about it. Short answer is that it actually is endian-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ResourceCategory it will be fine as long as it's repr(u8)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceCategory is dataful, which is why it is hard. The other problem is that I intend to remove ResourceCategory, so I don't want to spend a lot of time solving it. Also, even if ResourceCategory were repr(u8)
, it's still not safe to transmute, because the u8 could be out-of-range.
What do you suggest as the short-term solution?
- Check in this unsafe code first and remove ResourceCategory later
- Remove ResourceCategory first before merging this PR
- Make ResourceCategory safe, and then delete that code when I remove ResourceCategory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tinystr8!("foo")
has the same bit pattern whether it's compiled for LE or BE?
Regarless of the answer to that question, I think it would be better to use byte arrays (and byte string literals). They are a primitive, are endian-agnostic, and don't require a future reader to be familiar with tinystr
's implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not realize ResourceCategory is dataful.
I'm okay with having this code for now but I'd prioritize fixing this up.
@robertbastian yes, because it has the same bit pattern as the string, and strings are endianness-independent. In fact I've thought about making tinystr internally use byte arrays in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it overcomplicates things to write a TinyStr
but then compare to a byte array.
tools/datagen/src/bin/datagen.rs
Outdated
allowed_keys.extend(keys.map(|s| Cow::Borrowed(s))); | ||
} | ||
if let Some(key_file_path) = matches.value_of_os("KEY_FILE") { | ||
// eyre::bail!("Key file is not yet supported (see #192)",); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
#[repr(C)] | ||
pub struct ResourceKey { | ||
_tag0: TinyStr8, | ||
pub category: ResourceCategory, | ||
pub sub_category: TinyStr16, | ||
pub version: u16, | ||
pub version: u8, | ||
_tag1: TinyStr4, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to put these changes behind a feature? If this is going to be transparent to the client, we can also set the feature correctly and avoid the tags in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I would like from_repr_c
to be made safe though.
version: u8, | ||
) -> ResourceKey { | ||
ResourceKey { | ||
_tag0: tinystr8!("ICURES[["), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ResourceCategory it will be fine as long as it's repr(u8)
.
pub struct ResourceKey { | ||
_tag0: TinyStr8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: not a huge fan of this, but I guess it's no big deal and it does seem to be the best way to achieve this
``` | ||
|
||
Generate ICU4X Postcard blob (single file): | ||
Extract the keys from an executable into a key file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Extract the keys used by an executable into a key file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me.
Interesting to read the concerns about the endianness, but we seem to be aware of the potential/existing issues there.
I'm still unable to run datagen locally via:
cargo run --bin icu4x-datagen -- \
--cldr-tag 39.0.0 \
--all-keys \
--all-locales \
--format blob \
--out /tmp/icu4x_data/icu4x_data.postcard
I still get the "Please call close before dropping FilesystemExporter" bracktrace.
let mut allowed_keys: Option<HashSet<Cow<str>>> = None; | ||
if let Some(keys) = matches.values_of("KEYS") { | ||
let allowed_keys = allowed_keys.get_or_insert_with(Default::default); | ||
allowed_keys.extend(keys.map(|s| Cow::Borrowed(s))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (non-blocking)
This could probably be just
let mut allowed_keys = matches
.values_of("KEYS")
.map(|keys| keys.map(Cow::Borrowed).collect::<HashSet<_>>());
if let Some(key_file_path) = matches.value_of_os("KEY_FILE") {
...
Since I don't think get get
case of get_or_insert_with
can ever happen here when we always initialize it to None
.
version: u8, | ||
) -> ResourceKey { | ||
ResourceKey { | ||
_tag0: tinystr8!("ICURES[["), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tinystr8!("foo")
has the same bit pattern whether it's compiled for LE or BE?
Regarless of the answer to that question, I think it would be better to use byte arrays (and byte string literals). They are a primitive, are endian-agnostic, and don't require a future reader to be familiar with tinystr
's implementation details.
/// ``` | ||
pub fn from_repr_c(bytes: &[u8; 40]) -> Option<ResourceKey> { | ||
// Smoke check | ||
if &bytes[0..8] != b"ICURES[[" || &bytes[36..40] != b"]]**" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b"ICURES[["
and b"]]**
are hardcoded in three places (new
, from_repr_c
, and extract.rs
), maybe define a const
?
I'm going to let this PR bitrot a little and re-do it after the ResourceKey changes from #1148 are implemented. |
Superseded by #1480 |
Fixes #948
Fixes #1106
It works!
You can even see DCE doing its job. On a release build, the extra plurals key goes away:
The same tool works on both x86 binaries and WASM binaries (and presumably most other binary formats as well). You can run the following command to generate key files for all examples from WASM files: