Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add icu_provider::fxhash_32 #4028

Closed
wants to merge 4 commits into from
Closed

Add icu_provider::fxhash_32 #4028

wants to merge 4 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Sep 12, 2023

Just exporting this pre-existing function in order to implement the solution for #4024 suggested by @Manishearth in #4024 (comment)

@sffc sffc requested a review from Manishearth as a code owner September 12, 2023 08:23
@sffc sffc requested a review from robertbastian September 12, 2023 08:23
/// the benefit of a cryptographically secure algorithm
// The indexing operations in this function have been reviewed in detail and won't panic.
#[allow(clippy::indexing_slicing)]
pub(crate) const fn fxhash_32_trim(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this used to be called _manual_slice and that's also what the const fns in locid are called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also document this function's behavior

I think the comments about why fxhash should go on the module or on the other function, this function should have short documentation saying something like "runs fxhash_32 with a manual slice" and then document what it means for it to be a manual slice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's not manual slice; it takes "skip start" and "skip end" instead of "start" and "limit". This function has been around for a while and this PR does not seek to change its call sites
  2. This function is not public; I just moved it. I made a new public API for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, can you still move the docs?

@robertbastian
Copy link
Member

So I'm not a big fan of putting this on the public API. The implementation we have here is great for the use case we wrote it for, allowing cheap comparison for data keys at runtime and in const. If I understand the plan in #4024 correctly, we would generate a hash at datagen time and never calculate it at runtime, so these restrictions don't really apply and we can use any hash. Note that fxhash isn't great, we've had collisions with keys before.

@sffc
Copy link
Member Author

sffc commented Sep 12, 2023

The advantage of the proposed solution is

  1. No additional code or dependencies
  2. Establishes a standard solution across ICU4X when situations like Add IANA-to-BCP47 and reverse mappings #4024 arise
  3. fxhash is stable / deterministic

However I acknowledge @robertbastian's concerns. I don't know if they outweigh the advantages listed above. The alternative would be to introduce another small stable hash; t1ha is what we've been using in ZeroHashMap, but I think @Manishearth found problems with it.

The biggest concern I do share is that we've had problems with fxhash collisions before. I hope it was a fluke but it does raise concerns.

If we agree to add a new dependency to crates that need this functionality, I'm happy to do that. I made this PR though to solve my problem in an okay way without rocking the boat too much.

@Manishearth
Copy link
Member

(the problem with t1ha is that it seems unmaintained and fails miri in subtle ways that I haven't been able to debug well since the code isn't commented that well, I'd oppose adding it as a default dependency. I don't care too much about the optional ZHM dep as long as we're not using it)

@sffc
Copy link
Member Author

sffc commented Sep 12, 2023

@pdogr investigated a number of hash functions that are stable and suitable for serialization. I again looked at these when building ZeroTrie (45e0ff1). The two hash functions that produced the fewest collisions were t1ha and wyhash. These are also the two that @pdogr found to be the smallest and fastest.

It appears that the maintainer of t1ha would welcome PRs to fix Miri issues based on flier/rust-t1ha#6.

However, one signal in favor of wyhash is that the t1ha parent repository (the C implementation) has been archived by the owner on Sep 21, 2022.

In the Rust ecosystem, t1ha and wyhash are roughly the same order of magnitude of stars and downloads. wyhash has more downloads but t1ha has more stars.

There are PRs on the wyhash repo that have been open for a month with no follow ups or reviews.

Both t1ha and wyhash seem to be invented in approximately 2019 by independent programmers (one from Russia, one from China).

Does wyhash have the Miri problems @Manishearth?

@robertbastian
Copy link
Member

robertbastian commented Sep 12, 2023

do we need small and fast if they're datagen only? we could just use sha

@sffc
Copy link
Member Author

sffc commented Sep 12, 2023

which sha?

It'd be really nice to have just one hash function in icu4x and not think hard about it again.

Putting up a PR where I compare these and also xxhash which seems popular both in Rust and out.

@Manishearth
Copy link
Member

I can't tell if wyhash has the miri problems without first migration ZHM over to it, unfortunately.

It may just be worth fixing t1ha.

@sffc sffc closed this Sep 13, 2023
@sffc sffc deleted the fxhash-pub branch September 13, 2023 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants