Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write ResourceKey #1511

Merged
merged 19 commits into from
Jan 19, 2022
Merged

Re-write ResourceKey #1511

merged 19 commits into from
Jan 19, 2022

Conversation

sffc
Copy link
Member

@sffc sffc commented Jan 15, 2022

Fixes #1148
Depends on #1504
Depends on #1514

I implemented it as

#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Copy, Clone, Hash)]
#[repr(transparent)]
pub struct ResourceKeyHash([u8; 4]);

#[repr(C)]
pub struct ResourceKey {
    path: &'static str,
    _tag0: [u8; 8],
    hash: ResourceKeyHash,
    _tag1: [u8; 2],
}

When we search the executables, we can find the hash, which is enough for what we need. I anticipate adding a global hash-to-ResourceKey map for all ICU4X components in the tools directory so that we can reproduce the ResourceKey from a ResourceKeyHash. I think this is the best path forward.

I have not yet updated all the components to the new ResourceKey because I would like approval on the approach first.

@sffc
Copy link
Member Author

sffc commented Jan 15, 2022

A few more notes:

  1. I got rid of the macro and replaced it with a const function.
  2. I am aware of @robertbastian's Implement icu4x-key-extract by string tagging  #1480. However, in this PR I am implementing an approach different from either that one or my original attempt. I will post more comments over there.

///
/// Panics if the syntax of the path is not valid.
#[inline]
pub const fn new(path: &'static str) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the rust compiler will happily generate (potentially expensive) runtime code for this when this is called in a non const context: i recommend using a macro the way my tinystr PR does to force it to be a static. That way you can also guarantee that the hash will be in the source and not generated at runtime.

Copy link
Member Author

@sffc sffc Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I agree.

  1. We cannot prevent people from using the function, because the macro is just codegen and it needs to wrap a public function (although we could rename or hide the function).
  2. The type name is included in the prelude and it's more clear what the function is for (it creates a new ResourceKey, by definition); macros are relatively less friendly for documentation and tooling.
  3. My next step is going to be adding a const ResourceKey field to DataMarker, which will force these into a const context.
  4. The problem of const functions "accidentally" resolving at runtime instead of build time is not unique to us. I'd rather not introduce the concept of using macros to solve this problem unless you can point to something that says that this is the "official" way to solve the footgun.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case because it's going to be rare to construct these on your own I'm less worried about preventing people from using this, and more about giving people a convenient way to instantiate a const. There is no such convenient way currently.

But given that most users will just use the exported const, I guess it's fine.

I don't have anything saying this is the "official" way to solve the footgun but it's basically the only way and I asked other community members: the const thing happens because rust doesn't really pick whether to const a function based on whether the argument is a literal, and the only way to say "my arguments must be literals" is via a macro. It's the right tool for the job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to disallow runtime key creation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the typical way to do that is to remove all constructors (except often Default::default()), leave in a #[doc(hidden)] constructor with a name that has underscores in it (ResourceKey::__internal_construct() or something), and use it from the macro.

I don't think we need to here though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back the macro, and kept the try_new constructor but not the new constructor, with appropriate documentation.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipate adding a global hash-to-ResourceKey map for all ICU4X components in the tools directory so that we can reproduce the ResourceKey from a ResourceKeyHash. I think this is the best path forward.

I think requiring a central directory of keys is a big downside of tagging the hashes instead of the strings. I've updated #1480 to have a similar API to this PR, but tag the path instead.

path: &'static str,
_tag0: [u8; 8],
hash: ResourceKeyHash,
_tag1: [u8; 2],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceKeyHash has a fixed size, so we don't need the closing tag.

You could also use a single array for tag and key instead of making them separate fields and forcing them to be consecutive with repr(C):

    #[inline]
    pub const fn try_new(path: &'static str) -> Result<Self, ()> {
        match Self::check_path_syntax(path) {
            Ok(_) => {
              let mut bytes = *b"ICU4XK[\x02XXXX\x03]";
              bytes[9..=12] = helpers::fxhash_32(path.as_bytes()).to_le_bytes();
              Ok(Self { path, bytes })
            },
            Err(_) => Err(())
        }
    }

    #[inline]
    pub const fn get_hash(&self) -> ResourceKeyHash {
        self.bytes[9..=12] // modulo wrapping
    }

Copy link
Member Author

@sffc sffc Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the closing tag to make it less likely to have false positives (based on the assumption that one instance of a tag is more likely to occur than two tags separated by a specific number of bytes).

I'll need to check to make sure that self.bytes[9..=12] doesn't generate panicky code. Since it's an array, theoretically it should figure out that the operation can't fail, but in general the index operator [] is panicky.

What's wrong with repr(C)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing is actually done by a trait, so self.bytes[9..=12] isn't const.

Regarding repr(C) it just seems like it's the wrong hammer. We're not interacting with C code, we're just trying to put data in a specific order, which I think is clearer by using a single array if the array types match like they do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the assumption that one instance of a tag is more likely to occur than two tags separated by a specific number of bytes

I don't think that holds if the single tag has the same length as the two tags combined.

match Self::check_path_syntax(path) {
Ok(_) => Ok(Self {
path,
_tag0: *b"ICU4XK[\x02",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're making tags human readable (instead of just being magic numbers), why use control characters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may as well be magic numbers. I have no scientific reason for doing what I did here other than to make it less likely to have collisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a 2 any less likely than a 57? I don't see how any string is less or more likely than any other (maybe other than all NUL), so I'd prefer if these were all printable bytes.

If I'm not mistaken the choice of tag length came from the previous PR because it fits exactly into a tinystr8 and not any architectural or probabilistic arguments. If the bytes in a binary are uniformly distributed then it's very unlikely that 64 consecutive bits match this.

In any case, I'd appreciate removing _tag1 and making the tag printable (alternatively, use arbitrary numbers).

///
/// Panics if the syntax of the path is not valid.
#[inline]
pub const fn new(path: &'static str) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep category, sub_category, version as the public API, rather than having the client assemble the string.

This implies using a macro that concat!s the literals, and then delegates to this const fn (as it cannot concat).

Copy link
Member Author

@sffc sffc Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because:

  1. This new model allows multiple namespaces in key strings, rather than just category/subcategory, which was an artificial limitation
  2. I find that writing the string in code is actually shorter and more readable than using an opaque concatenation algorithm

///
/// Panics if the syntax of the path is not valid.
#[inline]
pub const fn new(path: &'static str) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to disallow runtime key creation?

/// );
/// ```
pub fn get_components(&self) -> ResourceKeyComponents {
self.into()
pub fn iter_components(&self) -> impl Iterator<Item = Cow<str>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be &str now as it's always borrowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

provider/core/src/resource.rs Show resolved Hide resolved
provider/core/src/helpers.rs Show resolved Hide resolved
provider/core/src/resource.rs Show resolved Hide resolved
@sffc
Copy link
Member Author

sffc commented Jan 17, 2022

I think requiring a central directory of keys is a big downside of tagging the hashes instead of the strings.

  1. The existence of a central directory is a separate decision than tagging hashes vs strings. We could easily write out hashes in hex notation into the keys file. If a central directory existed, we only have more flexibility in what we put in the key file.
  2. The primary purpose of a central directory is to prevent collisions. People would need to add their keys to the central directory only after hooking them up to icu4x-datagen.
  3. A central directory has other use cases. For example, it could be used to make BlobDataProvider capable of exporting data, something you expressed a desire to do in Implementing IterableProvider for FsDataProvider and BlobDataProvider #1506.

@sffc
Copy link
Member Author

sffc commented Jan 18, 2022

I would like to decouple this PR from the static data slicing / keyextract questions. I will remove the tags and the repr(C) from the data model and add those back in a future PR.

@sffc sffc requested review from nordzilla and a team as code owners January 18, 2022 19:34
@sffc sffc mentioned this pull request Jan 18, 2022
@sffc sffc removed the request for review from zbraniecki January 18, 2022 20:09
@sffc sffc removed the request for review from nordzilla January 18, 2022 20:09
@sffc
Copy link
Member Author

sffc commented Jan 18, 2022

Note: This PR comes with a code size win, most likely because the key string is computed at compile time.

Old:

-rwxr-xr-x 1 runner docker    43952 Jan 18 18:13 optim4.elf
-rwxr-xr-x 1 runner docker    33688 Jan 18 18:13 optim5.elf

New:

-rwxr-xr-x 1 runner docker    41584 Jan 18 20:12 optim4.elf
-rwxr-xr-x 1 runner docker    31736 Jan 18 20:12 optim5.elf

Manishearth
Manishearth previously approved these changes Jan 18, 2022
@@ -32,3 +32,7 @@ serde = { version = "1.0.123", optional = true, default-features = false, featur
serde_json = { version = "1.0", default-features = false, features = ["alloc"] }
bincode = "1.3"
postcard = { version = "0.7", features = ["use-std"] }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from #1514 which is now merged.

/// Therefore, users should not generally create ResourceKey instances; they should instead use
/// the ones exported by a component.
#[derive(PartialEq, Eq, Copy, Clone)]
#[repr(C)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

}
/// A compact hash of a [`ResourceKey`]. Useful for keys in maps.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Copy, Clone, Hash)]
#[repr(transparent)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the reprs here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the repr(transparent) on ResourceKeyHash because that one will be used as a ULE in the BlobProvider. But I removed the repr(C) on ResourceKey.

/// );
/// ```
pub fn get_components(&self) -> ResourceKeyComponents {
self.into()
pub fn iter_components(&self) -> impl Iterator<Item = &str> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have get_path, I don't see why we still (publicly) need this. Before this was the cheapest way to return the path, but now we can just return the static ref. So instead of doing path_buf.extend(key.iter_components()) we can do path_buf.push(key.get_path()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 272 to 302
pub fn get_component_0(&self) -> &str {
// This cannot fail because of the preconditions on path (at least one '/')
self.iter_components().next().unwrap()
}

/// Gets the second path component of a [`ResourceKey`].
///
/// # Examples
///
/// ```
/// use icu_provider::prelude::*;
///
/// let resc_key = icu_provider::hello_world::key::HELLO_WORLD_V1;
/// assert_eq!("helloworld@1", resc_key.get_component_1());
/// ```
pub fn get_component_1(&self) -> &str {
// This cannot fail because of the preconditions on path (at least one '/')
self.iter_components().nth(1).unwrap()
}

/// Gets the last path component of a [`ResourceKey`] without the version suffix.
///
/// # Examples
///
/// ```
/// use icu_provider::prelude::*;
///
/// let resc_key = icu_provider::hello_world::key::HELLO_WORLD_V1;
/// assert_eq!("helloworld", resc_key.get_last_component_no_version());
/// ```
pub fn get_last_component_no_version(&self) -> &str {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like these names at all...

Also, you just told me that you prefer the constructor to take a string instead of (category, subcategory, version) so there's more flexibility in how many namespaces a client uses, but this API very rigidly only supports the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted get_component_0 and get_component_1. I need get_last_component_no_version for compatibility with uprops provider. I filed #1515 to follow up on that dependency.

/// );
/// ```
pub fn get_components(&self) -> ResourceOptionsComponents {
self.into()
pub fn iter_components(&self) -> impl Iterator<Item = Cow<str>> {
Copy link
Member

@robertbastian robertbastian Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only ever used to extend a PathBuf. How about not constructing iterators and cows and instead doing something like:

pub fn push_to(&self, path_buf: &mut PathBuf) {
  match self.variant {
    Some(variant) => path_buf.push(*variant),
    _ => {}
  }
  match self.langid {
    Some(langid) => path_buf.push(langid.to_string()),
    _ => {}
  }
}

We could do something similar for ResourceKey (instead of exposing get_path and iter_components).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Good observation. I agree that we should clean this up.

One issue is that std::path::PathBuf is not no_std. Currently these APIs are no_std and are used indirectly by other no_std code.

I filed #1516 to follow up.

@sffc sffc requested a review from robertbastian January 18, 2022 22:34
robertbastian
robertbastian previously approved these changes Jan 18, 2022
Manishearth
Manishearth previously approved these changes Jan 18, 2022
@dpulls
Copy link

dpulls bot commented Jan 18, 2022

🎉 All dependencies have been resolved !

@sffc sffc dismissed stale reviews from Manishearth and robertbastian via aaf85e0 January 18, 2022 23:43
@sffc sffc merged commit 26089b2 into unicode-org:main Jan 19, 2022
@sffc sffc deleted the neo-datakey branch January 19, 2022 00:30
@sffc sffc linked an issue Jan 20, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove the fixed ResourceCategory enum? Revisit resource_key identifier being TinyStr16
3 participants