Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashtable local performance breakdowns #536

Closed
ThomasWaldmann opened this issue Jan 9, 2016 · 14 comments
Closed

hashtable local performance breakdowns #536

ThomasWaldmann opened this issue Jan 9, 2016 · 14 comments

Comments

@ThomasWaldmann
Copy link
Member

seems like some circumstance leads to local performance breakdowns depending on the amount N (and kind?) of input and the hashtable size T while usually giving good performance.

e.g.:

T = 262147 (initial size, grows when load>0.75)
H1.merge(H2) times:
N=100000 16ms
N=200000 108ms
N=240000 18385ms

Having a smaller max load factor seems to help a bit, but also seems to make the problem just less likely / less intense(?), but does not make it go away. Also, we should not go too low with the max load factor to not waste memory / get out-of-memory conditions.

I am wondering a bit, what causes this / what to do against it.

The hash function is taking the upper 32bits of the sha256 and computes modulo T to get the bucket index. T was even prime in one of my experiments.

Source: borg/_hashindex.c

@ThomasWaldmann
Copy link
Member Author

http://planetmath.org/goodhashtableprimes interesting, esp. property 3:

1.    each number in the list is prime
2.    each number is slightly less than twice the size of the previous
3.    each number is as far as possible from the nearest two powers of two

@ThomasWaldmann
Copy link
Member Author

https://github.com/fragglet/c-algorithms includes src/hash-table.c/h implementing separate chaining with linked lists.

good:

  • little memory needs for the hashtable itself (N * pointersize), so we can afford growing early (author grows table with load >0.33!)

bad:

  • lots of malloc(), lots of tiny objects
  • harder/slower to (un)serialize for disk storage
  • no easy mmap approach (in case we ever want to retry that)

@ThomasWaldmann
Copy link
Member Author

@ThomasWaldmann
Copy link
Member Author

currently, it seems to me that all the "magic" rules and mechanisms are just there to cope with:

  • bad hash functions that don't distribute well (does not apply as we have a cryptographic hash to start from)
  • "open addressing" collision handling using linear probing with increment > 1 (does not apply, we use 1)

So the options for us seem to be:

  • expose load factors, so people with lots of memory can use a low load factor giving high speed and people with tight memory can use a high one and sacrifice speed. the reasonable max load factor seems to be 0.3 .. 0.8.
  • use a different hashtable implementation with separate chaining

@infectormp
Copy link
Contributor

may be out of topic, but maybe we can use jemalloc to improve performance?

@ThomasWaldmann
Copy link
Member Author

I didn't try jemalloc yet, but the _hashindex.c code is only calling malloc/calloc if the hashtable is at max. load factor and needs to increase it's size. So it is relatively infrequent (e.g. not like calling malloc per hashtable entry).

@ThomasWaldmann
Copy link
Member Author

See #1429 for a promising approach.

@ThomasWaldmann ThomasWaldmann added this to the 1.1rc1 milestone Aug 15, 2016
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0b3, 1.1rc1 Sep 16, 2016
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1 - near future goals, 1.1.0b3 Nov 28, 2016
@ThomasWaldmann
Copy link
Member Author

I moved this to a later milestone, so it does not block beta3.
If we can't adopt this until rc phase, it'll have to wait until 1.2.

@enkore
Copy link
Contributor

enkore commented Mar 2, 2017

I wrote some instrumentation code and ran a couple operations and even in trivial cases with small tables I see probing lengths of 40-100. For a larger table (around a million chunks) with a resize-trigger of 32 probed slots I see eight resizes and a maximum probe length of 341 (which means that for a single lookup almost 16 KB of memory were scanned) (for a simple create op that only inserted a couple chunks).

These don't always go away with refilling the table (hence deciding "Oh, this probe length was long! Lets rebuild the table!" does not work - it rebuilds the table CONSTANTLY. Only changing the table size slightly seems to avoid this (num_buckets + 1).

@ThomasWaldmann
Copy link
Member Author

@enkore you measured the current hashtable implementation from master? Would be interesting to do same for robin hood hashing. Somehow strange, @rciorba's last benchmarks didn't show a big advantage for "get" ops, but in theory RH should not have such long probe lengths.

@rciorba
Copy link
Contributor

rciorba commented Mar 3, 2017

@ThomasWaldmann One mitigating factor is that test_chunk_indexer_c_getitem never has to deal with deletions, therefore no tombstones are touched in that benchmark. Hence the smaller benefit. If you look at the setitem_after_deletion benchmark (which deleted 1/5 of keys then benchmarks updating the values of all remaining keys) you'll see a big improvement for RH.

@ThomasWaldmann
Copy link
Member Author

btw, one major performance issue related to the hashtable filling up (completely) with tombstones (deleted entries) was recently fixed. another constant MAX_EFF_LOAD 0.93 was introduced (considering used + deleted entries, not just used entries like MAX_LOAD 0.75 does).

maybe this fixes some of the hashtable and merge performance issues seen in practice.
while merging H2 into H1 doesn't copy over tombstones from H2 (thus, eradicating them in result), if we start from a existing hashtable for H1, the tombstones of H1 were still there.

above mentioned fix will rebuild the hashtable once MAX_EFF_LOAD is exceeded.

@ThomasWaldmann
Copy link
Member Author

I am closing this for now. As mentioned in previous comment, one major issue (that maybe was the cause of these sudden perf breakdowns) was fixed.

If something else shows up, we'll open a fresh ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants