-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add icu_provider::fxhash_32 #4028
Conversation
/// the benefit of a cryptographically secure algorithm | ||
// The indexing operations in this function have been reviewed in detail and won't panic. | ||
#[allow(clippy::indexing_slicing)] | ||
pub(crate) const fn fxhash_32_trim( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this used to be called _manual_slice
and that's also what the const fns in locid are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also document this function's behavior
I think the comments about why fxhash should go on the module or on the other function, this function should have short documentation saying something like "runs fxhash_32
with a manual slice" and then document what it means for it to be a manual slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's not manual slice; it takes "skip start" and "skip end" instead of "start" and "limit". This function has been around for a while and this PR does not seek to change its call sites
- This function is not public; I just moved it. I made a new public API for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, can you still move the docs?
So I'm not a big fan of putting this on the public API. The implementation we have here is great for the use case we wrote it for, allowing cheap comparison for data keys at runtime and in const. If I understand the plan in #4024 correctly, we would generate a hash at datagen time and never calculate it at runtime, so these restrictions don't really apply and we can use any hash. Note that fxhash isn't great, we've had collisions with keys before. |
The advantage of the proposed solution is
However I acknowledge @robertbastian's concerns. I don't know if they outweigh the advantages listed above. The alternative would be to introduce another small stable hash; t1ha is what we've been using in ZeroHashMap, but I think @Manishearth found problems with it. The biggest concern I do share is that we've had problems with fxhash collisions before. I hope it was a fluke but it does raise concerns. If we agree to add a new dependency to crates that need this functionality, I'm happy to do that. I made this PR though to solve my problem in an okay way without rocking the boat too much. |
(the problem with t1ha is that it seems unmaintained and fails miri in subtle ways that I haven't been able to debug well since the code isn't commented that well, I'd oppose adding it as a default dependency. I don't care too much about the optional ZHM dep as long as we're not using it) |
@pdogr investigated a number of hash functions that are stable and suitable for serialization. I again looked at these when building ZeroTrie (45e0ff1). The two hash functions that produced the fewest collisions were t1ha and wyhash. These are also the two that @pdogr found to be the smallest and fastest. It appears that the maintainer of t1ha would welcome PRs to fix Miri issues based on flier/rust-t1ha#6. However, one signal in favor of wyhash is that the t1ha parent repository (the C implementation) has been archived by the owner on Sep 21, 2022. In the Rust ecosystem, t1ha and wyhash are roughly the same order of magnitude of stars and downloads. wyhash has more downloads but t1ha has more stars. There are PRs on the wyhash repo that have been open for a month with no follow ups or reviews. Both t1ha and wyhash seem to be invented in approximately 2019 by independent programmers (one from Russia, one from China). Does wyhash have the Miri problems @Manishearth? |
do we need small and fast if they're datagen only? we could just use sha |
which sha? It'd be really nice to have just one hash function in icu4x and not think hard about it again. Putting up a PR where I compare these and also xxhash which seems popular both in Rust and out. |
I can't tell if wyhash has the miri problems without first migration ZHM over to it, unfortunately. It may just be worth fixing t1ha. |
Just exporting this pre-existing function in order to implement the solution for #4024 suggested by @Manishearth in #4024 (comment)