-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed HashMap's internal layout. Cleanup. #21973
Conversation
dc6bc2d
to
64e8b91
Compare
// middle of a cache line, this strategy pulls in one cache line of hashes on | ||
// most lookups (64-byte cache line with 8-byte hash). I think this choice is | ||
// pretty good, but α could go up to 0.95, or down to 0.84 to trade off some | ||
// space. | ||
// | ||
// > Wait, what? Where did you get 1-α^k from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated, too.
64e8b91
to
407572c
Compare
let size = table.size(); | ||
let mut probe = Bucket::new(table, hash); | ||
let mut probe = if let Some(probe) = Bucket::new(table, hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/if let/match
Can this PR be split up into "In-place growth for HashMap" and "Cleanup"? |
Just leaving a note: I am concerned about how this change will negatively affect memory consumption for certain choices of K and V. That said, I think speed is more important than memory consumption, to some limit. |
It will also affect the performance of the "keys" and "values" iterators. |
49bfe41
to
f495fb8
Compare
Don't the posted benchmark results indicate that insertion ( (This outcome makes some sense to me, at least for a heavily loaded table, since the unused interleaved values for non-matching keys are going to occupy portions of the cache line when we are doing a probe sequence over a series of keys.)
I don't know how to evaluate whether the gain is worth the cost here. I just want to make sure that everyone is on board for this shift (or find out if my interpretation of the results is wrong). |
Just a note that I believe @pczarn is currently reworking this PR. |
f495fb8
to
6218462
Compare
I did the reworking. Some small details still need attention.
Sounds bad. Can you find a real world example of this? To keep the performance the same, I could make hashmap use two allocations, Here's a relevant post on data layout: http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ This is how benchmark results have changed:
@pnkfelix: Lookup has become slower after the first commit because of refactoring. Keep in mind that the benchmark does 1000 lookups per iteration and the difference between 41.7ns and 42.4ns per lookup is small. The improvement from in-place growth is suprisingly low. I'll have to check why. |
@pczarn The performance of key/value iterators is just because with the current design they walk over a compact array, and with your proposed design they eat twice as much cache as they do this. If doing a small per-key or per-value operation, this essentially halves memory bandwidth. |
Well strictly speaking it's already walking over the hashes checking for |
Oh shoot, I let this slip through the cracks. Needs a rebase (hopefully the last one, since we should be done with crazy API churn). |
6218462
to
316b300
Compare
Updated. I'm going to test and make corrections when new snapshots land. So iteration over small keys/values is already 2x-4x more cache-intensive than in an array. With larger values, like in To avoid the issue, keys/values can be stored in an array such as |
627837f
to
edcddb1
Compare
edcddb1
to
b7317cd
Compare
Done, tested. |
Great! Will review tonight. |
Ack, a few of my issues are addressed by later commits. Doing this commit-by-commit isn't the right strategy here. Shifting gears. |
// inform rustc that in fact instances of K and V are reachable from here. | ||
marker: marker::PhantomData<(K,V)>, | ||
// NB. The table will probably need manual impls of Send and Sync if this | ||
// field ever changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that needs saying? That's true for pretty much all Uniques.
☔ The latest upstream changes (presumably #24674) made this pull request unmergeable. Please resolve the merge conflicts. |
* use of NonZero hashes * refactoring * correct explanation of the load factor * better nomenclature * 'probe distance' -> 'displacement'
3f67ec1
to
776d23e
Compare
@gankro (this was rebased and needs review) |
@Manishearth Yes I tried to get someone else to review over two months ago. :/ Today's that last day of my pseudo-vacation so I'm free to tackle this again this week. |
Alright I've started just reading the full source at the last commit's hash just because the changes have been so comprehensive that it borders on a rewrite. Was Deref vs Borrow ever fully addressed? Borrow is generally just a trait for Deref-like+Hash/Eq/Ord equivalence. |
pub fn into_bucket(self) -> Bucket<K, V, M> { | ||
/// Duplicates the current position. This can be useful for operations | ||
/// on two or more buckets. | ||
pub fn stash(self) -> Bucket<K, V, Bucket<K, V, M, S>, S> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about -> Bucket<K, V, Self, S>
?
So, to the best of my knowledge this code seems to be correct, but I have strong reservations about where we're going with respect to the proliferation of type-complexity. I used to be able to grok this code pretty ok: We have HashMap, RawTable, and some Buckets. Now everything's super generic over some "M" type and there's Partial* types all over. It's not clear that these abstractions are pulling their weight: are they preventing real bugs or enabling simpler or more maintainable code? |
I share that sentiment. |
It's been a couple weeks with no response on any of my comments, and I'm not a huge fan of the general design changes. As such I'm closing this for now. We can continue discussion on the PR and maybe re-open if we get somewhere. |
…chton Cache conscious hashmap table Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at rust-lang#21973. This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a _no-brainer_ because of how the RH algorithm works and that's unchanged. **Lookups**: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout). **Inserts/Deletes/Resize**: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions. Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower. Total wasted padding between items (C being the capacity of the table). * Old layout: C * (K-K padding) + C * (V-V padding) * Proposed: C * (K-V padding) + C * (V-K padding) In practice padding between K-K and V-V *can* be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, _hardly the average case in practice_). Starting from the worst case the memory overhead is: * `HashMap<u64, u8>` 46% memory overhead. (aka *worst case*) * `HashMap<u64, u16>` 33% memory overhead. * `HashMap<u64, u32>` 20% memory overhead. * `HashMap<T, T>` 0% memory overhead * Worst case based on sizeof K + sizeof V: | x | 16 | 24 | 32 | 64 | 128 | |----------------|--------|--------|--------|-------|-------| | (8+x+7)/(8+x) | 1.29 | 1.22 | 1.18 | 1.1 | 1.05 | I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % grow_10_000 922,064 783,933 -138,131 -14.98% grow_big_value_10_000 1,901,909 1,171,862 -730,047 -38.38% grow_fnv_10_000 443,544 418,674 -24,870 -5.61% insert_100 2,469 2,342 -127 -5.14% insert_1000 23,331 21,536 -1,795 -7.69% insert_100_000 4,748,048 3,764,305 -983,743 -20.72% insert_10_000 321,744 290,126 -31,618 -9.83% insert_int_bigvalue_10_000 749,764 407,547 -342,217 -45.64% insert_str_10_000 337,425 334,009 -3,416 -1.01% insert_string_10_000 788,667 788,262 -405 -0.05% iter_keys_100_000 394,484 374,161 -20,323 -5.15% iter_keys_big_value_100_000 402,071 620,810 218,739 54.40% iter_values_100_000 424,794 373,004 -51,790 -12.19% iterate_100_000 424,297 389,950 -34,347 -8.10% lookup_100_000 189,997 186,554 -3,443 -1.81% lookup_100_000_bigvalue 192,509 189,695 -2,814 -1.46% lookup_10_000 154,251 145,731 -8,520 -5.52% lookup_10_000_bigvalue 162,315 146,527 -15,788 -9.73% lookup_10_000_exist 132,769 128,922 -3,847 -2.90% lookup_10_000_noexist 146,880 144,504 -2,376 -1.62% lookup_1_000_000 137,167 132,260 -4,907 -3.58% lookup_1_000_000_bigvalue 141,130 134,371 -6,759 -4.79% lookup_1_000_000_bigvalue_unif 567,235 481,272 -85,963 -15.15% lookup_1_000_000_unif 589,391 453,576 -135,815 -23.04% merge_shuffle 1,253,357 1,207,387 -45,970 -3.67% merge_simple 40,264,690 37,996,903 -2,267,787 -5.63% new 6 5 -1 -16.67% with_capacity_10e5 3,214 3,256 42 1.31% ``` ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % iter_keys_100_000 391,677 382,839 -8,838 -2.26% iter_keys_1_000_000 10,797,360 10,209,898 -587,462 -5.44% iter_keys_big_value_100_000 414,736 662,255 247,519 59.68% iter_keys_big_value_1_000_000 10,147,837 12,067,938 1,920,101 18.92% iter_values_100_000 440,445 377,080 -63,365 -14.39% iter_values_1_000_000 10,931,844 9,979,173 -952,671 -8.71% iterate_100_000 428,644 388,509 -40,135 -9.36% iterate_1_000_000 11,065,419 10,042,427 -1,022,992 -9.24% ```
Cache conscious hashmap table Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973. This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a _no-brainer_ because of how the RH algorithm works and that's unchanged. **Lookups**: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout). **Inserts/Deletes/Resize**: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions. Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower. Total wasted padding between items (C being the capacity of the table). * Old layout: C * (K-K padding) + C * (V-V padding) * Proposed: C * (K-V padding) + C * (V-K padding) In practice padding between K-K and V-V *can* be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, _hardly the average case in practice_). Starting from the worst case the memory overhead is: * `HashMap<u64, u8>` 46% memory overhead. (aka *worst case*) * `HashMap<u64, u16>` 33% memory overhead. * `HashMap<u64, u32>` 20% memory overhead. * `HashMap<T, T>` 0% memory overhead * Worst case based on sizeof K + sizeof V: | x | 16 | 24 | 32 | 64 | 128 | |----------------|--------|--------|--------|-------|-------| | (8+x+7)/(8+x) | 1.29 | 1.22 | 1.18 | 1.1 | 1.05 | I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % grow_10_000 922,064 783,933 -138,131 -14.98% grow_big_value_10_000 1,901,909 1,171,862 -730,047 -38.38% grow_fnv_10_000 443,544 418,674 -24,870 -5.61% insert_100 2,469 2,342 -127 -5.14% insert_1000 23,331 21,536 -1,795 -7.69% insert_100_000 4,748,048 3,764,305 -983,743 -20.72% insert_10_000 321,744 290,126 -31,618 -9.83% insert_int_bigvalue_10_000 749,764 407,547 -342,217 -45.64% insert_str_10_000 337,425 334,009 -3,416 -1.01% insert_string_10_000 788,667 788,262 -405 -0.05% iter_keys_100_000 394,484 374,161 -20,323 -5.15% iter_keys_big_value_100_000 402,071 620,810 218,739 54.40% iter_values_100_000 424,794 373,004 -51,790 -12.19% iterate_100_000 424,297 389,950 -34,347 -8.10% lookup_100_000 189,997 186,554 -3,443 -1.81% lookup_100_000_bigvalue 192,509 189,695 -2,814 -1.46% lookup_10_000 154,251 145,731 -8,520 -5.52% lookup_10_000_bigvalue 162,315 146,527 -15,788 -9.73% lookup_10_000_exist 132,769 128,922 -3,847 -2.90% lookup_10_000_noexist 146,880 144,504 -2,376 -1.62% lookup_1_000_000 137,167 132,260 -4,907 -3.58% lookup_1_000_000_bigvalue 141,130 134,371 -6,759 -4.79% lookup_1_000_000_bigvalue_unif 567,235 481,272 -85,963 -15.15% lookup_1_000_000_unif 589,391 453,576 -135,815 -23.04% merge_shuffle 1,253,357 1,207,387 -45,970 -3.67% merge_simple 40,264,690 37,996,903 -2,267,787 -5.63% new 6 5 -1 -16.67% with_capacity_10e5 3,214 3,256 42 1.31% ``` ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % iter_keys_100_000 391,677 382,839 -8,838 -2.26% iter_keys_1_000_000 10,797,360 10,209,898 -587,462 -5.44% iter_keys_big_value_100_000 414,736 662,255 247,519 59.68% iter_keys_big_value_1_000_000 10,147,837 12,067,938 1,920,101 18.92% iter_values_100_000 440,445 377,080 -63,365 -14.39% iter_values_1_000_000 10,931,844 9,979,173 -952,671 -8.71% iterate_100_000 428,644 388,509 -40,135 -9.36% iterate_1_000_000 11,065,419 10,042,427 -1,022,992 -9.24% ```
Changes HashMap's memory layout from
[hhhh...KKKK...VVVV...]
to[KVKVKVKV...hhhh...]
. This makes in-place growth easy to implement (and efficient).The removal of
find_with_or_insert_with
has made more cleanup possible.20 benchmark runs, averaged:
thanks to @gankro for the Entry interface, and to @thestinger for improving jemalloc's in-place realloc!
cc @cgaebel
r? @gankro