Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Robin Hood hashing #1429

Closed
wants to merge 50 commits into from
Closed

WIP: Robin Hood hashing #1429

wants to merge 50 commits into from

Conversation

rciorba
Copy link
Contributor

@rciorba rciorba commented Aug 2, 2016

This changes the collision handling strategy in _hashindex.c to
robin-hood hashing. Incidentally this should mutually be compatible
with the old version.

This hasn't been properly tested yet, except by the unittests, but
I'm sharing this early to get some feedback. Some testing and a
before and after performance measurement will follow.

Also please note this temporarily disables a couple of tests that
hard coded hashes that now no longer match.

rciorba added 2 commits August 2, 2016 21:11
This changes the collision handling strategy in _hashindex.c to
robin-hood hashing. Incidentally this should be mutually compatible
with the old version.
This hasn't been properly tested yet, except by the unittests, but
I'm sharing this early to get some feedback. Some testing and a
before-and-after performance measurement will follow.
The 2 failing tests seem to have a hardcoded hash that no longer
matches the repo hash once the robin-hood collision handling has
been implemented.

registry.RemoteRepositoryCheckTestCase setUp will fail if any
selftest fails thus preventing a bunch of tests from running.

Obviously this commit is not intended to be merged, but it's here
to allow the rest of the tests to run, to show that the proposed
change doesn't break borg's tests in fundamental ways.
@@ -68,7 +68,7 @@ static int hash_sizes[] = {
};

#define HASH_MIN_LOAD .25
#define HASH_MAX_LOAD .75 /* don't go higher than 0.75, otherwise performance severely suffers! */
#define HASH_MAX_LOAD 0.95 /* don't go higher than 0.75, otherwise performance severely suffers! */
Copy link
Contributor

@PlasmaPower PlasmaPower Aug 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed? Also, make sure to update the comment.

Minor nitpicking, don't put a leading zero if the definition above doesn't have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed? Also, make sure to update the comment.

The point of the Robin Hood hashing is to minimize the worst case for collisions by spreading the pain across all addresses. This should allow high loads in the hash table without performance degrading much. Also I should add the idea for this change isn't mine, @ThomasWaldmann suggested it as something interesting to do at the EuroPython sprints.

I intentionally didn't update the comments until I run some benchmarks to find an appropriate value.

Minor nitpicking, don't put a leading zero if the definition above doesn't have it.

Will do. BTW, the code style for C in this project isn't 100% clear to me, so If there are any other style no-no's in my PR, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! :)
If you change HASH_MAX_LOAD, do a full text search for it, there is another place depending on its value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change HASH_MAX_LOAD, do a full text search for it, there is another place depending on its value.

Had, a look. All I can see is the comment next to the value and docs/internals.rst.
I would update both once I identify a good value for this constant. Let me know if there's any places I've missed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search for 1.35 in the source.

@rciorba
Copy link
Contributor Author

rciorba commented Aug 2, 2016

Thanks for the feedback so far. I'll follow up once I get some performance numbers in as well.

/* we have a collision */
other_offset = distance(
index, idx, hashindex_index(index, bucket_ptr));
if ( other_offset < offset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: no blank after (

@ThomasWaldmann
Copy link
Member

Really looking forward to the effects of this CS. The paper sounded very promising about performance / efficiency.

@ThomasWaldmann
Copy link
Member

Related: #536

@ThomasWaldmann
Copy link
Member

@rciorba can you do a separate fixup commit with fixes for all the feedback you got?

@rciorba
Copy link
Contributor Author

rciorba commented Aug 4, 2016

Sure thing. I'll have some time later today.

@@ -111,7 +111,7 @@ hashindex_index(HashIndex *index, const void *key)
static int
hashindex_lookup(HashIndex *index, const void *key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could also be optimized. Currently for the worst case scenario (not finding a key in the index) we scan all buckets until we find an empty one. At high fill ratios this might get close to O(N). However if we track the maximum offset in the entire hashmap we could bail after at most max_offset iterations.

As the PR is currently, we could just load an old hashmap and start operating on it with the new collision handling code, and it would just work, and also hashmaps created by this would still be usable by olded borg versions. Changing hashindex_lookup however would require us to convert the hashmap explicitly, and also change the HashHeader to track this max offset. That would be a bigger deal because it would impact backwards compatibility, so some planning needs to go in to this.

One potential idea would be to use the MAGIC string in the header to also encode a version. For example if we turn BORG_IDX to BORG_I and 2 bytes for versioning, we could determine if this version of the index is fully robin-hood compliant and if not we could convert it on load from disk.

@ThomasWaldmann I'd like to hear your thoughts on this.

Copy link
Contributor Author

@rciorba rciorba Aug 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could bail after at most max_offset iterations.

Actually if the offset of the key is smaller than the number of buckets we've looked at we can bail. There's no way the next bucket will contain out desired key, since it would have been swapped on insert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanning a big part of hashindex sounds evil. guess at 95% fill rate, we would run into having to always scan about 20% of all buckets.

maybe that was the real reason for the perf breakdown i was sometimes seeing?

maybe first keep the code compatible until it is accepted / merged, but we'll keep the idea for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, one way to keep it compatible and still speedup hashindex_lookup is to always reinsert all items when loading from disk (no more expensive than a resize). I'll do some measurements of performance with and without this implemented.

 * correctly track offset when stealing the location of existing bucket
   on collision handling
 * extract function to swap memory locations
 * avoid malloc/free on each call to hashindex_set
 * fixed many typos
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1rc1, 1.1.0b2 Aug 27, 2016
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 28, 2017

@enkore the backshift deletion does not need tombstones, so would solve that potential issue.

there are two cases of a bit more agressive HT use:

  • when borg decides that a file has not changed (mtime/inode/size same), it takes the chunk id list from the files cache and verifies that they are all still present in the repo (by looking them up in the hashindex, which is assumed to be correct). So a potentially long list of hashids gets looked up in a burst with no I/O. As for many backups most files don't change, this is the usual case.
  • when the chunkindex gets resynced, hashtables get merged

@rciorba
Copy link
Contributor Author

rciorba commented Jan 30, 2017

@enkore The current HT implementation of hashindex_lookup, when it finds a key, will move said key over the first tombstone it encountered during the scan. This means the impact of tombstones is not that big.

A potentially worst case scenario would be deleting a large number of keys would, for a while, make every get also perform a memory swap.

Also, while thinking about an answer to your comment I realized my other PR did something stupid with tombstones. So thanks for your comment and here's a PR to fix that issue: #2116

@enkore
Copy link
Contributor

enkore commented Jan 30, 2017

@enkore The current HT implementation of hashindex_lookup, when it finds a key, will move said key over the first tombstone it encountered during the scan. This means the impact of tombstones is not that big.

But doesn't that leave another tombstone in the place where the key was found (since further items with the same key could be located beyond the found item the bucket-chain can't be broken)?

@rciorba
Copy link
Contributor Author

rciorba commented Jan 30, 2017

@enkore You're right, that does leave another tombstone, but it's hard to reason about what the impact of them is. The only really bad case I can imagine is having the hash at 75% fill rate, but having the 25% free space be mostly tombstones. And I'm not sure if that's likely to happen, since inserting keys would replace the tombstones with real values.

My reasoning about this:

  • say we're at 75% fill, no tombstones,
  • delete half the keys 37.5 % fill, 37.5% tombstones (performance is no worse than at 75% fill rate, after the swaps have settled down)
  • insert back up all the way to 75%, X% tombstones, the inserts will have replaces a bunch of them

How about we add a simple tracking of tombstones on an experimental 'telemetry' branch, then we get some actual numbers (basically find out what X is)?

Also, an eager delete could be implemented for the current master branch as well. I'll have a crack at it tonight/tomorrow if time allows.

@enkore
Copy link
Contributor

enkore commented Jan 30, 2017

Yes, my reasoning so far always has been in the direction of tombstones creating very long bucket chains that make look-ups slow, since that would correlate well with the "worst case" issues observed.

How about we add a simple tracking of tombstones on an experimental 'telemetry' branch, then we get some actual numbers (basically find out what X is)?

Sounds like a good idea. See also #1429 (comment) and #1429 (comment)

Perhaps that would be somewhat efficient at combating the supposed problem as well (resize / refill table if found_bucket - supposed_bucket > some n).

@ThomasWaldmann
Copy link
Member

a rebase would be useful so this can be practically tested.

@pczarn
Copy link

pczarn commented Jan 31, 2017

Having tombstones in hash maps with robin hood hashing makes even less sense than in other hash maps.

I suggest a load factor between 80% and 90%. The difference between 90% and 93% may seem small, but having 7% free space instead of 10% is a big deal. A similar effect slows down hard drive file systems that are almost full. For a discussion about decreasing load factor, see rust-lang/rust#38003

You may want to remove the use of distance from hashindex_lookup, since that conditional break optimizes cases where the lookup fails, and you expect a hit rate of 100% in your application. Keep in mind that hashindex_lookup is used in hashindex_set. (In Rust's hashmap, we don't restart the search like in this PR. Whenever a new element is inserted, we continue at the offset from the failed initial lookup. However, I think the difference is tiny.)

Are keys influenced by foreign input? I'm not sure what is the purpose of this implementation.

while(!BUCKET_IS_EMPTY(index, idx) && !BUCKET_IS_DELETED(index, idx)) {
idx = (idx + 1) % index->num_buckets;
/* we have a collision */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop skips over some number of entries, then continues by displacing other entries. Our inserted entry will take the bucket of the first displaced entry. Instead, you can forward-shift all buckets from the first displaced entry until the end of the chunk, using memmove. By increasing the displacement of all these buckets by 1, you keep the invariant of robin hood hashing, which relies on comparing displacements, not on their absolute value (other_offset < offset).

Tell me if something in my explanation is unclear. I'm using different names. My word for "distance" is "displacement".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically take the block (the entire contiguous section of buckets up to the first empty bucket) and memmove it forward one address, then insert at the ideal location. Oh my, yes, such an elegant solution! Thanks!

not on their absolute value (other_offset < offset).

Not sure what you mean by absolute value. The offsets are the relative distance from the ideal bucket a key would be in vs which bucket it is now, so I think it's the same as what you call "displacement".

Tell me if something in my explanation is unclear. I'm using different names. My word for "distance" is "displacement".

Indeed, displacement is a better name for it.

Thanks for the feedback!

@rciorba
Copy link
Contributor Author

rciorba commented Jan 31, 2017

Thanks for the feedback!

Having tombstones in hash maps with robin hood hashing makes even less sense than in other hash maps.

There are no tombstones in this implementation. This PR also became a place to discuss the problems of the current implementation, hence the discussion about them.

I suggest a load factor between 80% and 90%. The difference between 90% and 93% may seem small, but having 7% free space instead of 10% is a big deal. A similar effect slows down hard drives. For a discussion about decreasing load factor, see rust-lang/rust#38003

Thanks. I'll check those out. I did benchmark at different load levels, so the performance impact should be visible there as well.

https://cdn.rawgit.com/rciorba/e3eded290e91a361f74eefdbc84416bb/raw/076524eb4a68697deafb7c1217113e65b526ebff/results.html

You may want to remove the use of distance from hashindex_lookup, since that conditional break optimizes cases where the lookup fails, and you expect a hit rate of 100% in your application.

I'll benchmark without the distance as well, but the periodic run of distance (every 16 or every 8 buckets) seems to have a small enough impact to be worth it. (rh8 and rh16 in the linked graphs).

Keep in mind that hashindex_lookup is used in hashindex_set. (In Rust's hashmap, we don't restart the search like in this PR. Whenever a new element is inserted, we continue at the offset from the failed initial lookup. However, I think the difference is tiny.)

This PR adds a pointer param so we can 'return' not just the found index but also the address to start the search in case of a miss, so in that respect it should behave the same as Rust's hashmap.

Are keys influenced by foreign input?

I don't understand your question, could you please elaborate?

I'm not sure what is the purpose of this implementation.

I believe the end goal was to use less memory by allowing larger fill rates. Implementing RH was mentioned at a EuroPython sprint for volunteers to pick up, and it sounded interesting, but this is my first contact with their code base. Since I don't know the project that well, I'm not sure what the usage patterns of the hashmap are. I was hoping the maintainers of borg could help guide me with that, hence the multiple variations of the implementation being bench-marked to see what the tradeoffs are for different use cases (like the distance short-cutting in lookup and it's impact on getitem whit no misses, vs getitem with misses).

@enkore
Copy link
Contributor

enkore commented Jan 31, 2017

Are keys influenced by foreign input? I'm not sure what is the purpose of this implementation.

Keys are the first few bits of a (usually keyed) cryptographic hash of input data. In the unkeyed case keys can thus be influenced to some extent; in the keyed case (used whenever the data is stored encrypted) this isn't possible at all.

I'd say that for all intents and purposes the keys can be seen as uniformly distributed values across the range.

@rciorba
Copy link
Contributor Author

rciorba commented Feb 1, 2017

@enkore That makes sense. Thank you!

@rciorba
Copy link
Contributor Author

rciorba commented Feb 27, 2017

Closing this PR and starting a fresh one since this one has grown to massive proportions and many of the intermediate changesets are no longer relevant to the final form.

@rciorba rciorba closed this Feb 27, 2017
@rciorba rciorba mentioned this pull request Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants