Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace hash with faster and better finalized hash #37

Merged
merged 7 commits into from
Jun 2, 2024

Conversation

orlp
Copy link
Contributor

@orlp orlp commented May 14, 2024

The current FxHasher design has a few issues:

  1. It is relatively slow for longer strings/slices.
  2. It has no finalization at all, which means that for the last integer (or in the case of a single integer, the only integer) any differences in bit i do not affect bits less significant than i. Since most hash table implementations (including hashbrown) compute their bucket indices from the least significant bits, this means everything can collide for a hash table of size 2^n if all differences are in bits n or above.

With this PR I aim to fix 1 by adding a proper string hash, and mitigate 2 by adding a bit rotation as a finalizer. This bit rotation moves the high-entropy high bits into the low bits expected by hash table implementations. I also simplified the hasher to a polynomial hash, which further improved compile times and makes the hasher less of an arbitrary mish-mash of operations and more a combination of justifiable and understandable components.

I will make a dummy PR on rust-lang to do a perf run for this PR.

bors added a commit to rust-lang-ci/rust that referenced this pull request May 14, 2024
[DO NOT MERGE] perf run for rustc-hash candidate

See rust-lang/rustc-hash#37.
@orlp
Copy link
Contributor Author

orlp commented May 14, 2024

Note that should this get merged for optimal performance the nightly feature should be added to all places that include rustc-hash in the Rust repository. It avoids a bit of overhead in the string / slice hash by avoiding the prefix-free encoding that's not necessary.

bors added a commit to rust-lang-ci/rust that referenced this pull request May 15, 2024
[DO NOT MERGE] perf run for rustc-hash candidate

See rust-lang/rustc-hash#37.
@lqd
Copy link
Member

lqd commented May 15, 2024

My recollection is that rustc doesn't hash a lot of strings so issue 1 didn't seem very impactful in practice, did your investigations show the opposite maybe?

FWIW, a while back I also tried this same hash function for the integers, but with a different constant from the same paper and various finalization ops -- as such a sequence is simple enough that my rust lookalike of hash-prospector finds it very quickly. It wasn't an obvious improvement over the current hash at the time, and had less promising results than #18.

(Edit: for comparison, here are #18's results, cycles here https://perf.rust-lang.org/compare.html?start=d53f1e8fbf891cf84fcb11eb078a27e528df795a&end=71812917089730aa5abc2f6c5978c6718c9b0f5f&stat=cycles%3Au)

@orlp
Copy link
Contributor Author

orlp commented May 15, 2024

@orlp
Copy link
Contributor Author

orlp commented May 15, 2024

@lqd The integer hash is the more important one for sure, but there are also some string hashes in the top 99% of usage:

     calls         %       cum%
  38646076 (24.3749%) (24.3749%): int(8)
  38216888 (24.1042%) (48.4791%): int(4)
  24624533 (15.5312%) (64.0103%): int(8), int(8)
  16260169 (10.2556%) (74.2660%): int(8), int(8), int(8)
  10060303 ( 6.3452%) (80.6112%): int(8), int(8), int(8), int(8)
   6120759 ( 3.8605%) (84.4717%): int(4), int(8)
   4004318 ( 2.5256%) (86.9973%): int(8), int(8), int(8), int(8), int(8), int(8)
   2629811 ( 1.6587%) (88.6560%): int(4), int(4)
   2531435 ( 1.5966%) (90.2526%): int(8), int(8), int(8), int(8), int(8)
   2394405 ( 1.5102%) (91.7628%): int(2), int(2), int(4)
   2037351 ( 1.2850%) (93.0478%): int(4), int(8), int(8), int(8), int(8)
   1721253 ( 1.0856%) (94.1335%): int(4), int(4), int(8)
   1593315 ( 1.0049%) (95.1384%): int(4), int(8), int(8)
    803773 ( 0.5070%) (95.6454%): int(4), int(4), int(4), int(8)
    652177 ( 0.4113%) (96.0567%): int(1), int(8)
    585849 ( 0.3695%) (96.4262%): int(4), int(4), int(8), int(8)
    542942 ( 0.3424%) (96.7687%): str(32), int(8), int(8), int(8), int(8)
    404925 ( 0.2554%) (97.0241%): str(4), int(1)
    350200 ( 0.2209%) (97.2449%): str(12), int(1)
    303208 ( 0.1912%) (97.4362%): str(24), int(1)
    243336 ( 0.1535%) (97.5896%): str(6), int(1)
    206939 ( 0.1305%) (97.7202%): str(3), int(1)
    202990 ( 0.1280%) (97.8482%): str(16), int(1)
    186403 ( 0.1176%) (97.9658%): int(4), int(4), int(8), int(8), int(8)
    178159 ( 0.1124%) (98.0781%): int(1), int(8), int(8), int(8), int(8), int(8)
    177280 ( 0.1118%) (98.1899%): str(5), int(1)
    170946 ( 0.1078%) (98.2978%): int(4), int(8), int(8), int(8), int(8), int(8)
    166919 ( 0.1053%) (98.4030%): int(4), int(8), int(8), int(8)
    155574 ( 0.0981%) (98.5012%): str(2), int(1)
    149823 ( 0.0945%) (98.5957%): str(20), int(1)
    138569 ( 0.0874%) (98.6831%): str(80), int(1)
    125819 ( 0.0794%) (98.7624%): str(7), int(1)
    124243 ( 0.0784%) (98.8408%): int(8), int(8), int(8), int(8), int(8), int(8), int(8)
    123919 ( 0.0782%) (98.9189%): str(8), int(1)
    103888 ( 0.0655%) (98.9845%): int(16)
     94770 ( 0.0598%) (99.0442%): int(1), int(8), int(8), int(8), int(16)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@nnethercote
Copy link
Contributor

It would be good to squash all the commits together before merging this, because they're not logically separate changes.

@nnethercote
Copy link
Contributor

With the additional clarifications via comments and squashing I am happy with this. I don't normally give r+ to PRs in this repo so I will leave that to @Nilstrieb(?)

@nnethercote
Copy link
Contributor

@orlp And thanks for answering my questions and helping me get past my initial skepticism :) There have been multiple unsuccessful attempts by multiple people to improve on FxHash, some mentioned in this blog post, so it's nice to finally see success!

@WaffleLapkin
Copy link
Member

Do you think it might be worth renaming Fx* to something else? We are not really bound by semver here and could just make a breaking change.

@nnethercote
Copy link
Contributor

git grep tells me there are 2766 occurences of FxHash/FxIndex in the codebase. I don't think it's worth changing the name.

@orlp
Copy link
Contributor Author

orlp commented May 19, 2024

It's just a whole lot of pain for little gain. I think my comment in the readme is fine (and also explains why its called FxHash):

The original hash algorithm provided by this crate was one taken from Firefox,
hence the hasher it provides is called FxHasher. This name is kept for backwards
compatibility, but the underlying hash has since been replaced. The current
design for the hasher is a polynomial hash finished with a single bit rotation,
together with a wyhash-inspired compression function for strings/slices, both
designed by Orson Peters.

Cargo.toml Outdated Show resolved Hide resolved
Copy link
Member

@Noratrieb Noratrieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you very much! I doubted you at RustNL but I will never doubt you again😊😊

@Noratrieb Noratrieb merged commit 3e39d8e into rust-lang:master Jun 2, 2024
3 checks passed
rust-cloud-vms bot pushed a commit to aDotInTheVoid/rust that referenced this pull request Jul 29, 2024
This was implemented in rust-lang/rustc-hash#37
by Orson Peters and benchmarked in
rust-lang#125133.
rust-cloud-vms bot pushed a commit to aDotInTheVoid/rust that referenced this pull request Jul 29, 2024
This was implemented in rust-lang/rustc-hash#37
by Orson Peters and benchmarked in
rust-lang#125133.
bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 29, 2024
…ogaloo, r=<try>

Use new faster fxhash.

This was implemented in rust-lang/rustc-hash#37 by Orson Peters and benchmarked in
rust-lang#125133.

r? `@ghost`

`@bors` try `@rust-timer` queue
bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 24, 2024
Update rustc-hash to version 2

This brings in the new algorithm.

see rust-lang/rustc-hash#37 and rust-lang#125133
bors added a commit to rust-lang/rust-analyzer that referenced this pull request Sep 4, 2024
Update rustc-hash to version 2

This brings in the new optimized algorithm that was shown to have small performance benefits for rustc. I haven't run the rust-analyzer benchmarks.

See rust-lang/rustc-hash#37.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants