Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add icu4x-key-extract for Static Data Slicing #1460

Closed
wants to merge 11 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Dec 31, 2021

Fixes #948
Fixes #1106

It works!

$ cargo build --example work_log
$ cargo run --bin icu4x-key-extract -- \
    -i target/debug/examples/work_log \
    -o /tmp/icu4x_data/work_log+keys.txt
$ cat /tmp/icu4x_data/work_log+keys.txt
datetime/lengths@1
datetime/skeletons@1
datetime/symbols@1
plurals/cardinal@1
plurals/ordinal@1

You can even see DCE doing its job. On a release build, the extra plurals key goes away:

$ cargo build --example work_log --release
$ cargo run --bin icu4x-key-extract -- \
    -i target/release/examples/work_log \
    -o /tmp/icu4x_data/work_log+keys.txt
$ cat /tmp/icu4x_data/work_log+keys.txt
datetime/lengths@1
datetime/skeletons@1
datetime/symbols@1
plurals/ordinal@1

The same tool works on both x86 binaries and WASM binaries (and presumably most other binary formats as well). You can run the following command to generate key files for all examples from WASM files:

$ cargo make wasm-key-extract-examples

@sffc sffc requested a review from a team as a code owner December 31, 2021 02:12
@sffc sffc requested a review from iainireland December 31, 2021 02:13
version: u8,
) -> ResourceKey {
ResourceKey {
_tag0: tinystr8!("ICURES[["),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to work for a BE binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will work so long as the fields are endian-agnostic. I changed the version field to be u8 instead of u16 to address this problem. The only field I'm not sure about is ResourceCategory, but I left a TODO to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TinyStr in particular, there's a conversation about it. Short answer is that it actually is endian-safe.

zbraniecki/tinystr#45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ResourceCategory it will be fine as long as it's repr(u8).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceCategory is dataful, which is why it is hard. The other problem is that I intend to remove ResourceCategory, so I don't want to spend a lot of time solving it. Also, even if ResourceCategory were repr(u8), it's still not safe to transmute, because the u8 could be out-of-range.

What do you suggest as the short-term solution?

  1. Check in this unsafe code first and remove ResourceCategory later
  2. Remove ResourceCategory first before merging this PR
  3. Make ResourceCategory safe, and then delete that code when I remove ResourceCategory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tinystr8!("foo") has the same bit pattern whether it's compiled for LE or BE?

Regarless of the answer to that question, I think it would be better to use byte arrays (and byte string literals). They are a primitive, are endian-agnostic, and don't require a future reader to be familiar with tinystr's implementation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not realize ResourceCategory is dataful.

I'm okay with having this code for now but I'd prioritize fixing this up.

@robertbastian yes, because it has the same bit pattern as the string, and strings are endianness-independent. In fact I've thought about making tinystr internally use byte arrays in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it overcomplicates things to write a TinyStr but then compare to a byte array.

allowed_keys.extend(keys.map(|s| Cow::Borrowed(s)));
}
if let Some(key_file_path) = matches.value_of_os("KEY_FILE") {
// eyre::bail!("Key file is not yet supported (see #192)",);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines +83 to +90
#[repr(C)]
pub struct ResourceKey {
_tag0: TinyStr8,
pub category: ResourceCategory,
pub sub_category: TinyStr16,
pub version: u16,
pub version: u8,
_tag1: TinyStr4,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to put these changes behind a feature? If this is going to be transparent to the client, we can also set the feature correctly and avoid the tags in production code.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I would like from_repr_c to be made safe though.

version: u8,
) -> ResourceKey {
ResourceKey {
_tag0: tinystr8!("ICURES[["),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ResourceCategory it will be fine as long as it's repr(u8).

pub struct ResourceKey {
_tag0: TinyStr8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: not a huge fan of this, but I guess it's no big deal and it does seem to be the best way to achieve this

```

Generate ICU4X Postcard blob (single file):
Extract the keys from an executable into a key file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Extract the keys used by an executable into a key file"

@nordzilla nordzilla removed their request for review January 4, 2022 23:12
Copy link
Member

@nordzilla nordzilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me.

Interesting to read the concerns about the endianness, but we seem to be aware of the potential/existing issues there.

I'm still unable to run datagen locally via:

cargo run --bin icu4x-datagen -- \        
    --cldr-tag 39.0.0 \
    --all-keys \
    --all-locales \
    --format blob \
    --out /tmp/icu4x_data/icu4x_data.postcard

I still get the "Please call close before dropping FilesystemExporter" bracktrace.

let mut allowed_keys: Option<HashSet<Cow<str>>> = None;
if let Some(keys) = matches.values_of("KEYS") {
let allowed_keys = allowed_keys.get_or_insert_with(Default::default);
allowed_keys.extend(keys.map(|s| Cow::Borrowed(s)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking)

This could probably be just

    let mut allowed_keys = matches
        .values_of("KEYS")
        .map(|keys| keys.map(Cow::Borrowed).collect::<HashSet<_>>());
    if let Some(key_file_path) = matches.value_of_os("KEY_FILE") {
    ...

Since I don't think get get case of get_or_insert_with can ever happen here when we always initialize it to None.

version: u8,
) -> ResourceKey {
ResourceKey {
_tag0: tinystr8!("ICURES[["),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tinystr8!("foo") has the same bit pattern whether it's compiled for LE or BE?

Regarless of the answer to that question, I think it would be better to use byte arrays (and byte string literals). They are a primitive, are endian-agnostic, and don't require a future reader to be familiar with tinystr's implementation details.

/// ```
pub fn from_repr_c(bytes: &[u8; 40]) -> Option<ResourceKey> {
// Smoke check
if &bytes[0..8] != b"ICURES[[" || &bytes[36..40] != b"]]**" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b"ICURES[[" and b"]]** are hardcoded in three places (new, from_repr_c, and extract.rs), maybe define a const?

@sffc
Copy link
Member Author

sffc commented Jan 14, 2022

I'm going to let this PR bitrot a little and re-do it after the ResourceKey changes from #1148 are implemented.

@sffc sffc marked this pull request as draft January 20, 2022 18:25
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 20, 2022
@sffc
Copy link
Member Author

sffc commented Feb 3, 2022

Superseded by #1480

@sffc sffc closed this Feb 3, 2022
@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 3, 2022
@sffc sffc deleted the key-extract branch August 24, 2022 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build tool for static code analysis for data slicing Implement static data slicing
4 participants