Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease HashMap Load Factor #38003

Closed
pssalmeida opened this issue Nov 25, 2016 · 34 comments
Closed

Decrease HashMap Load Factor #38003

pssalmeida opened this issue Nov 25, 2016 · 34 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. I-needs-decision Issue: In need of a decision. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@pssalmeida
Copy link

pssalmeida commented Nov 25, 2016

Currently HashMap uses a load factor of approximately 0.9. While this is appropriate for reads (get), as Robin Hood Hashing gives a low distance from the starting point with very low, effectively fixed variance, such is not the case for writes (insert).

When writing over an occupied bucked, it will incur a displacement involving reads and writes over several buckets, either for the new value or the values already in the map. Not only this number is larger than for reads, but also has an increasing variance, quickly becoming very large as load approaches 1.

For a load factor a: (UPDATED, previous lookup values were for random probing)

  • the average number of probes in a read is 1/a*ln(1/1-a)) (1+1/(1-a))/2
  • the average number of written buckets in a write is 1/(1-a)
  • the probability of writing no more than k buckets in a write is 1 - a^k

In a table, the average buckets written, the 95th percentile buckets written, and the average buckets probed in a get, for different load factors:

a Avg W 95% W Avg R
0.500 2 4.3 1.39 1.5
0.666 3 7.4 1.65 2.0
0.750 4 10.4 1.84 2.5
0.800 5 13.4 2.01 3.0
0.833 6 16.4 2.15 3.5
0.857 7 19.4 2.27 4.0
0.875 8 22.4 2.38 4.5
0.888 9 25.4 2.47 5.0
0.900 10 28.4 2.56 5.5
0.950 20 58.4 3.15 10.5
0.990 100 298.1 4.65 50.5

It can be seen that for a = 0.9, the average probes when reading is just 2.56 5.5, but the average number of buckets written when inserting is 10 and the 95th percentile is 28 buckets. For deletes it will not be much better when doing the backwards shift, as most keys will be displaced from their origin (the very low variance behaviour or RH).

While for many cases, the average over the map lifetime may not be much impacted, if it happens that a map is being operated near capacity, with elements constantly being inserted and others removed, e.g., when implementing an LRU cache, the number of buckets (re)written per insert (and also per delete) will be manifestly higher than desired.

While the exact load factor choice can be argued about, it seems that 0.9 is manifestly too high. Going from 0.9 to 0.8 would have a 12.5% memory overhead, while going from 0.8 to 0.9 doubles the average number of buckets written when inserting, and more than doubles the 95th percentile. Therefore, a load factor around 0.8 seems more reasonable.

@sfackler
Copy link
Member

Concrete performance numbers for various load factors would probably be helpful.

@abonander
Copy link
Contributor

Honestly, why isn't the load factor configurable to begin with? That would make it quite a bit easier to tune for specific use cases.

@arthurprs
Copy link
Contributor

I was going to propose 0.833 (5/6) based on some tests I ran earlier, but we need more numbers to take an informed decision.

@funny-falcon
Copy link

Looks like 0.8 gives 20% improvement on read (ok, number of buckets readed is not directly transfers to performance, but still).

@pczarn
Copy link
Contributor

pczarn commented Nov 27, 2016

I don't know the distribution for the number of buckets accessed by insertion, but here are accurate numbers for lookup.

a 50% R 90% R 95% R
0.8 1.8 5.5 7.0
0.833 2.1 6.7 8.7
0.9 3.5 10.9 14.5

lookup_cost_with_circles2

@funny-falcon
Copy link

funny-falcon commented Nov 27, 2016

So, 0.8 compared to 0.833 consumes 5% more memory and gives 15-20% less collision chain.

@arthurprs
Copy link
Contributor

@funny-falcon how did you come up with these numbers?

@pssalmeida
Copy link
Author

@abonander I also found surprising not being able to choose load factor, something that would be a good idea. Anyway, the question about what the sensible default should be remains.

@pczarn I find strange such long tails for lookup distribution ... I had the impression the variance would be much less. Also the shape does not look right. For high load factor it should start by a gentle slope and have a steep slope near the average.

Anyway, lookups are not the problem, as we are talking about 1 cache line most times, sometimes 2, rarely more than 2. I would also point out that for inserts, the cost is "amplified" by the payload size. We are not talking about only traversing hashes. An average of 10 means, in addition to the hashes, displacing 10 times the size of key and value. This makes such large numbers even more scary.

@funny-falcon
Copy link

@arthurprs only with mathematics and numbers from a table above. (Not, I've edited my comment: 15-20% less collision chain, not performance. Cause of great cache locality of Robin-Hood hashing, performance difference will be usually lesser).

For used space:

  • 0.833 fill rate means one need 1/0.833=1.2 space, or 20% excess space.
  • 0.8 fill rate means one need 1/0.8=1.25 space, or 25% excess space.
  • 1.25/1.2 = 1.042, so it means, 0.8 fill rate uses 4.2% more space (in average) than 0.833 fill rate.

@matklad
Copy link
Member

matklad commented Nov 28, 2016

@pssalmeida are those back of an envelope calculations? They feel a bit odd to me, because they say that the average probe count for reading at 0.9 load factor should be 2.56. I've recently measured plain linear probing and robin hood hashing and I get 5.48 as the average probe length (for the elements which are in the table). Perhaps my measurements are wrong, but it still might be a good idea to instrument and measure the current implementation, as suggested by @sfackler .

@arthurprs
Copy link
Contributor

@matklad I was checking your test-repo, I can't wrap my head around the timing/cache results. It should take less time and put less pressure on the cache, it makes no sense to me whatsoever.

@matklad
Copy link
Member

matklad commented Nov 28, 2016

@arthurprs this puzzles me as well! I want to plot the actual distribution of probe lengths and to run some specific tests about cache access (basically execute queries like "sum of the elements from i to k + i" with different distributions of k). But at least the paper you've linked to from another issues seems to imply that RH is not automatically better than plain LP?

We can conclude that RH provides a very interesting trade-off: for a small penalty (often within 1-5%) in peak performance on the best of cases (all lookups successful), RH significantly improves on the worst-case over LP in general, up to more than a factor 4.

@funny-falcon
Copy link

Possible solve of a puzzle:

  • lesser variance not only means there is less number of long chains, but it could also mean there is less number of short chains.
  • longer chain have more impact on mean value,
  • So, given mean value of all chains is same, we have RH hashing has "larger mean of small chains".
  • in other words: while RH has better "worst time" for single lookup, it has worse "average time".

And it is clear result of its algorithm: it prefers to make short chains longer to cut probability of "too long chain".

I think, if you compare "Distribution of number of buckets" between Linear hashing and RH hashing (as @pczarn did above for RH), it will describe your timing and cache misses.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 28, 2016

That's it @funny-falcon. It's probably worth pointing out that the hash algorithm is "~perfect" and there's no deletions involved, so that's the best case for LP.

RH LP

25'th percentile is 2
50'th percentile is 4
85'th percentile is 10
90'th percentile is 12
95'th percentile is 15
99'th percentile is 23
100'th percentile is 56

LP

25'th percentile is 1
50'th percentile is 1
85'th percentile is 6
90'th percentile is 10
95'th percentile is 21
99'th percentile is 75
100'th percentile is 1271

@pssalmeida
Copy link
Author

@matklad I made a mistake. I was using the results from https://arxiv.org/abs/1605.04031, which I had just skimmed, but now I noticed that they discuss random probing. Meanwhile I found this talk https://www.ime.usp.br/~yoshi/FoCM2014/talks/viola.pdf, which states that the mean is (1 + 1/(1-a))/2, which gives exactly your experimental results!

So, the results are worse, even for lookups, which is another reason to decrease the load factor.

@pssalmeida
Copy link
Author

Updated table with new formula for average number of buckets in lookup, for linear probing.

@sfackler
Copy link
Member

Again, I care far more about actual timings of real operations rather than probe counts.

@arthurprs
Copy link
Contributor

@pssalmeida Let me know if you'd like me to run numbers. I'm running HM benchmarks extensively during the last couple of months.

@pssalmeida
Copy link
Author

@sfackler sure. If someone who has been already doing such measurements on HashMap could volunteer it would be nice. The experiments by @matklad already show that timing and cache misses can increase over a simple linear probing.

Intuitively, for simple LP, most hashes will have smaller distances to origin than RH LP, and will incur a single cache line read. So, even if some are farther, as long as they remain in a second cache line, it won't matter much if they are farther away, and only rarely a third cache line will be visited.

For RH LP, most hashes will have a distance to origin closer to the mean. If that is not very small, a non-negligible number of lookups may incur a second cache line read, giving worse results.

@pssalmeida
Copy link
Author

@arthurprs If you could run some tests it would be great. An interesting case would be doing lots of insert-delete cycles in a full loaded to capacity table.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 28, 2016

EDIT: updated numbers bellow

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Nov 28, 2016
@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 28, 2016
@brson
Copy link
Contributor

brson commented Nov 28, 2016

Numbers look good to me.

@sfackler
Copy link
Member

@abonander One potential concern there is that the load factor you'd select is somewhat tied to the implementation. If we significantly changed HashMap's internals (as we're probably going to do to resolve the n^2 thing), the load factor someone's selected may all of the sudden be unacceptably slow.

@abonander
Copy link
Contributor

abonander commented Nov 29, 2016

@sfackler Then it may be better to abstract the load factor behind an enum that just lets them select a load factor that produces the desired effect from the implementation, maybe like the following:

/// An abstraction over a hashtable's load factor which allows the user to choose which aspect they
/// would prefer to optimize, possibly at a detriment to other metrics.
pub enum LoadFactor {
    /// Use a load factor that produces the fastest lookup times without seriously degrading other aspects.
    OptimizeLookup,
    /// Use a load factor that produces the fastest insertion times without seriously degrading other aspects.
    OptimizeInsert,
    /// Use a load factor that produces the lowest memory usage without seriously degrading other aspects.
    SaveMemory,
    /// Use a load-factor with good all-round performance.
    Balanced,
    /// A custom load factor whose performance may be affected by changes to the implementation.
    Custom(f64),
}

Then the default would be Balanced. People using the Custom variant will be acknowledging that their chosen load-factor may change performance characteristics when the implementation is changed.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 29, 2016

The fill factor isn't guaranteed or exposed directly by the api, so It shouldn't be a problem. An api for exposing/changing load factor will likely need an rfc.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 29, 2016

Complete results now, code is here https://gist.github.com/arthurprs/97dabb2c28f8d778dc377d69d0b95758

It's really important to emphasize that only the grow benchmark is realistic, the others simulate the worst case performance of each load factor.

removed due to broken numbers, see bellow

@pssalmeida
Copy link
Author

@arthurprs Thanks for the benchmarks. The lru_sim, even if not realistic shows an impressive difference, but even the others show that there is considerable impact going from 0.91 to lower load factors.

@arthurprs
Copy link
Contributor

Going over some issues I noticed that people subscribed to this might also be interested on #38368

@arthurprs
Copy link
Contributor

arthurprs commented Apr 19, 2017

Now that #40561 is in, this is the only remaining place I can think of that we can extract nice gains (at least without some kind of major rewrite with it's own trade-offs).

With the updated code above I conducted several benchmarks with various load factors: 0.909 (today), 0.875, 0.857, 0.833 and 0.8;
The results can be seen in the gist (edit: removed)

The current implementation is reasonably compact, so it wouldn't be unreasonable to set it down to 0.80, as that's still high (the highest?) compared to most stdlibs. Something in between is probably the way to go though.

@arthurprs
Copy link
Contributor

arthurprs commented May 1, 2017

I found out that the lookup benchs are broken. I will do another round of tests soon.

@arthurprs
Copy link
Contributor

arthurprs commented May 26, 2017

Ok, I found some time today and here are the final results. It turns out that benchmarking across load factors is really hard.

Tested load factors: 90.9% (current), 87.5%, 85.7%, 83.3% and 80%.

 name                            s909_ ns/iter  s875_ ns/iter  diff ns/iter   diff % 
 ::grow_10_000                   711,918        629,273             -82,645  -11.61% 
 ::grow_big_value_10_000         1,420,685      1,286,508          -134,177   -9.44% 
 ::insert_100_000                3,504,832      3,453,408           -51,424   -1.47% 
 ::insert_int_big_value_100_000  7,560,059      7,538,186           -21,873   -0.29% 
 ::insert_str_100_000            3,761,604      3,632,067          -129,537   -3.44% 
 ::lookup_100_000                341,234        344,626               3,392    0.99% 
 ::lookup_100_000_unif           344,869        345,345                 476    0.14% 
 ::lru_sim                       9,748,455      9,800,613            52,158    0.54% 
 ::merge_shuffle                 1,062,361      970,109             -92,252   -8.68% 
➜  hashmap2 git:(adapt) ✗ cargo benchcmp s909_ s857_ bench.txt
 name                            s909_ ns/iter  s857_ ns/iter  diff ns/iter   diff % 
 ::grow_10_000                   711,918        590,330            -121,588  -17.08% 
 ::grow_big_value_10_000         1,420,685      1,231,806          -188,879  -13.29% 
 ::insert_100_000                3,504,832      3,543,471            38,639    1.10% 
 ::insert_int_big_value_100_000  7,560,059      7,601,703            41,644    0.55% 
 ::insert_str_100_000            3,761,604      3,688,459           -73,145   -1.94% 
 ::lookup_100_000                341,234        347,284               6,050    1.77% 
 ::lookup_100_000_unif           344,869        345,188                 319    0.09% 
 ::lru_sim                       9,748,455      9,665,578           -82,877   -0.85% 
 ::merge_shuffle                 1,062,361      956,885            -105,476   -9.93% 
➜  hashmap2 git:(adapt) ✗ cargo benchcmp s909_ s833_ bench.txt
 name                            s909_ ns/iter  s833_ ns/iter  diff ns/iter   diff % 
 ::grow_10_000                   711,918        564,378            -147,540  -20.72% 
 ::grow_big_value_10_000         1,420,685      1,177,490          -243,195  -17.12% 
 ::insert_100_000                3,504,832      3,484,783           -20,049   -0.57% 
 ::insert_int_big_value_100_000  7,560,059      7,541,711           -18,348   -0.24% 
 ::insert_str_100_000            3,761,604      3,615,516          -146,088   -3.88% 
 ::lookup_100_000                341,234        347,486               6,252    1.83% 
 ::lookup_100_000_unif           344,869        352,215               7,346    2.13% 
 ::lru_sim                       9,748,455      9,739,924            -8,531   -0.09% 
 ::merge_shuffle                 1,062,361      908,930            -153,431  -14.44% 
➜  hashmap2 git:(adapt) ✗ cargo benchcmp s909_ s80_ bench.txt
 name                            s909_ ns/iter  s80_ ns/iter  diff ns/iter   diff % 
 ::grow_10_000                   711,918        517,805           -194,113  -27.27% 
 ::grow_big_value_10_000         1,420,685      1,132,740         -287,945  -20.27% 
 ::insert_100_000                3,504,832      3,542,416           37,584    1.07% 
 ::insert_int_big_value_100_000  7,560,059      7,613,803           53,744    0.71% 
 ::insert_str_100_000            3,761,604      3,911,995          150,391    4.00% 
 ::lookup_100_000                341,234        345,034              3,800    1.11% 
 ::lookup_100_000_unif           344,869        343,880               -989   -0.29% 
 ::lru_sim                       9,748,455      9,656,683          -91,772   -0.94% 
 ::merge_shuffle                 1,062,361      863,767           -198,594  -18.69%

lru_sim (delete + insert), insert and lookup are expected to perform the same because the load factor is effectively fixed in those benchmarks.

@arthurprs
Copy link
Contributor

And some extra worst case tests, where constants are set to 95% of .capacity() after creation with with_capacity(X) https://gist.github.com/arthurprs/ae084a0c5a119ddd732d93eb6fd31d05

@Mark-Simulacrum Mark-Simulacrum added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 26, 2017
@dtolnay dtolnay added I-needs-decision Issue: In need of a decision. and removed relnotes Marks issues that should be documented in the release notes of the next release. labels Nov 18, 2017
@arthurprs
Copy link
Contributor

This may not be as significant with the new hashmap (hashbrown) implementation. @Amanieu might want to add his thoughts here.

@Amanieu
Copy link
Member

Amanieu commented Nov 8, 2019

The main concern here was that the high load factor would cause many entries to be moved during an insertion. Since the new hash table implementation does not move existing entries on insert, I think this can be closed.

@Amanieu Amanieu closed this as completed Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. I-needs-decision Issue: In need of a decision. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests