-
-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Robin Hood hashing #1429
WIP: Robin Hood hashing #1429
Conversation
This changes the collision handling strategy in _hashindex.c to robin-hood hashing. Incidentally this should be mutually compatible with the old version. This hasn't been properly tested yet, except by the unittests, but I'm sharing this early to get some feedback. Some testing and a before-and-after performance measurement will follow.
The 2 failing tests seem to have a hardcoded hash that no longer matches the repo hash once the robin-hood collision handling has been implemented. registry.RemoteRepositoryCheckTestCase setUp will fail if any selftest fails thus preventing a bunch of tests from running. Obviously this commit is not intended to be merged, but it's here to allow the rest of the tests to run, to show that the proposed change doesn't break borg's tests in fundamental ways.
src/borg/_hashindex.c
Outdated
@@ -68,7 +68,7 @@ static int hash_sizes[] = { | |||
}; | |||
|
|||
#define HASH_MIN_LOAD .25 | |||
#define HASH_MAX_LOAD .75 /* don't go higher than 0.75, otherwise performance severely suffers! */ | |||
#define HASH_MAX_LOAD 0.95 /* don't go higher than 0.75, otherwise performance severely suffers! */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed? Also, make sure to update the comment.
Minor nitpicking, don't put a leading zero if the definition above doesn't have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed? Also, make sure to update the comment.
The point of the Robin Hood hashing is to minimize the worst case for collisions by spreading the pain across all addresses. This should allow high loads in the hash table without performance degrading much. Also I should add the idea for this change isn't mine, @ThomasWaldmann suggested it as something interesting to do at the EuroPython sprints.
I intentionally didn't update the comments until I run some benchmarks to find an appropriate value.
Minor nitpicking, don't put a leading zero if the definition above doesn't have it.
Will do. BTW, the code style for C in this project isn't 100% clear to me, so If there are any other style no-no's in my PR, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! :)
If you change HASH_MAX_LOAD, do a full text search for it, there is another place depending on its value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change HASH_MAX_LOAD, do a full text search for it, there is another place depending on its value.
Had, a look. All I can see is the comment next to the value and docs/internals.rst
.
I would update both once I identify a good value for this constant. Let me know if there's any places I've missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search for 1.35 in the source.
Thanks for the feedback so far. I'll follow up once I get some performance numbers in as well. |
src/borg/_hashindex.c
Outdated
/* we have a collision */ | ||
other_offset = distance( | ||
index, idx, hashindex_index(index, bucket_ptr)); | ||
if ( other_offset < offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no blank after (
Really looking forward to the effects of this CS. The paper sounded very promising about performance / efficiency. |
Related: #536 |
@rciorba can you do a separate fixup commit with fixes for all the feedback you got? |
Sure thing. I'll have some time later today. |
src/borg/_hashindex.c
Outdated
@@ -111,7 +111,7 @@ hashindex_index(HashIndex *index, const void *key) | |||
static int | |||
hashindex_lookup(HashIndex *index, const void *key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could also be optimized. Currently for the worst case scenario (not finding a key in the index) we scan all buckets until we find an empty one. At high fill ratios this might get close to O(N). However if we track the maximum offset in the entire hashmap we could bail after at most max_offset iterations.
As the PR is currently, we could just load an old hashmap and start operating on it with the new collision handling code, and it would just work, and also hashmaps created by this would still be usable by olded borg versions. Changing hashindex_lookup
however would require us to convert the hashmap explicitly, and also change the HashHeader to track this max offset. That would be a bigger deal because it would impact backwards compatibility, so some planning needs to go in to this.
One potential idea would be to use the MAGIC string in the header to also encode a version. For example if we turn BORG_IDX to BORG_I and 2 bytes for versioning, we could determine if this version of the index is fully robin-hood compliant and if not we could convert it on load from disk.
@ThomasWaldmann I'd like to hear your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could bail after at most max_offset iterations.
Actually if the offset of the key is smaller than the number of buckets we've looked at we can bail. There's no way the next bucket will contain out desired key, since it would have been swapped on insert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanning a big part of hashindex sounds evil. guess at 95% fill rate, we would run into having to always scan about 20% of all buckets.
maybe that was the real reason for the perf breakdown i was sometimes seeing?
maybe first keep the code compatible until it is accepted / merged, but we'll keep the idea for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, one way to keep it compatible and still speedup hashindex_lookup
is to always reinsert all items when loading from disk (no more expensive than a resize). I'll do some measurements of performance with and without this implemented.
* correctly track offset when stealing the location of existing bucket on collision handling * extract function to swap memory locations * avoid malloc/free on each call to hashindex_set * fixed many typos
@enkore the backshift deletion does not need tombstones, so would solve that potential issue. there are two cases of a bit more agressive HT use:
|
@enkore The current HT implementation of hashindex_lookup, when it finds a key, will move said key over the first tombstone it encountered during the scan. This means the impact of tombstones is not that big. A potentially worst case scenario would be deleting a large number of keys would, for a while, make every get also perform a memory swap. Also, while thinking about an answer to your comment I realized my other PR did something stupid with tombstones. So thanks for your comment and here's a PR to fix that issue: #2116 |
But doesn't that leave another tombstone in the place where the key was found (since further items with the same key could be located beyond the found item the bucket-chain can't be broken)? |
@enkore You're right, that does leave another tombstone, but it's hard to reason about what the impact of them is. The only really bad case I can imagine is having the hash at 75% fill rate, but having the 25% free space be mostly tombstones. And I'm not sure if that's likely to happen, since inserting keys would replace the tombstones with real values. My reasoning about this:
How about we add a simple tracking of tombstones on an experimental 'telemetry' branch, then we get some actual numbers (basically find out what X is)? Also, an eager delete could be implemented for the current master branch as well. I'll have a crack at it tonight/tomorrow if time allows. |
Yes, my reasoning so far always has been in the direction of tombstones creating very long bucket chains that make look-ups slow, since that would correlate well with the "worst case" issues observed.
Sounds like a good idea. See also #1429 (comment) and #1429 (comment) Perhaps that would be somewhat efficient at combating the supposed problem as well (resize / refill table if |
a rebase would be useful so this can be practically tested. |
Having tombstones in hash maps with robin hood hashing makes even less sense than in other hash maps. I suggest a load factor between 80% and 90%. The difference between 90% and 93% may seem small, but having 7% free space instead of 10% is a big deal. A similar effect slows down hard drive file systems that are almost full. For a discussion about decreasing load factor, see rust-lang/rust#38003 You may want to remove the use of Are keys influenced by foreign input? I'm not sure what is the purpose of this implementation. |
src/borg/_hashindex.c
Outdated
while(!BUCKET_IS_EMPTY(index, idx) && !BUCKET_IS_DELETED(index, idx)) { | ||
idx = (idx + 1) % index->num_buckets; | ||
/* we have a collision */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop skips over some number of entries, then continues by displacing other entries. Our inserted entry will take the bucket of the first displaced entry. Instead, you can forward-shift all buckets from the first displaced entry until the end of the chunk, using memmove
. By increasing the displacement of all these buckets by 1, you keep the invariant of robin hood hashing, which relies on comparing displacements, not on their absolute value (other_offset < offset
).
Tell me if something in my explanation is unclear. I'm using different names. My word for "distance" is "displacement".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically take the block (the entire contiguous section of buckets up to the first empty bucket) and memmove
it forward one address, then insert at the ideal location. Oh my, yes, such an elegant solution! Thanks!
not on their absolute value (other_offset < offset).
Not sure what you mean by absolute value. The offsets are the relative distance from the ideal bucket a key would be in vs which bucket it is now, so I think it's the same as what you call "displacement".
Tell me if something in my explanation is unclear. I'm using different names. My word for "distance" is "displacement".
Indeed, displacement is a better name for it.
Thanks for the feedback!
Thanks for the feedback!
There are no tombstones in this implementation. This PR also became a place to discuss the problems of the current implementation, hence the discussion about them.
Thanks. I'll check those out. I did benchmark at different load levels, so the performance impact should be visible there as well.
I'll benchmark without the distance as well, but the periodic run of
This PR adds a pointer param so we can 'return' not just the found index but also the address to start the search in case of a miss, so in that respect it should behave the same as Rust's hashmap.
I don't understand your question, could you please elaborate?
I believe the end goal was to use less memory by allowing larger fill rates. Implementing RH was mentioned at a EuroPython sprint for volunteers to pick up, and it sounded interesting, but this is my first contact with their code base. Since I don't know the project that well, I'm not sure what the usage patterns of the hashmap are. I was hoping the maintainers of borg could help guide me with that, hence the multiple variations of the implementation being bench-marked to see what the tradeoffs are for different use cases (like the distance short-cutting in lookup and it's impact on getitem whit no misses, vs getitem with misses). |
Keys are the first few bits of a (usually keyed) cryptographic hash of input data. In the unkeyed case keys can thus be influenced to some extent; in the keyed case (used whenever the data is stored encrypted) this isn't possible at all. I'd say that for all intents and purposes the keys can be seen as uniformly distributed values across the range. |
@enkore That makes sense. Thank you! |
Insert a few keys, delete some of them, check we still have the values we expect, check the deleted ones aren't there.
Because I was calling hashindex_init with different capacities (based on desired fill rate), the actual fill rate didn't match expected.
setup would always finish by returning a new empty index, regardless of what was prepared.
Closing this PR and starting a fresh one since this one has grown to massive proportions and many of the intermediate changesets are no longer relevant to the final form. |
This changes the collision handling strategy in _hashindex.c to
robin-hood hashing. Incidentally this should mutually be compatible
with the old version.
This hasn't been properly tested yet, except by the unittests, but
I'm sharing this early to get some feedback. Some testing and a
before and after performance measurement will follow.
Also please note this temporarily disables a couple of tests that
hard coded hashes that now no longer match.