-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2653: [C++] Refactor hash table support #3005
Conversation
Dict building benchmark (gcc 7.3.0, AMD Ryzen 7):
Bottom line: the new code is around 10% to 40% faster. |
Compute benchmark:
Bottom line: hash kernels are 2x faster on integers and competitive on strings. |
For the record, the performance of the underlying hash functions:
The integer benchmark uses 64-bit ints, so it means we're hashing 1.5 billion 64-bit ints per second. |
74e86cc
to
3d46455
Compare
I would like to refactor/simplify hardware CRC32 support but there's some intriguing comment in |
Also I don't understand why ARROW_USE_SSE is off by default, but we still do macro-based switching in |
@pitrou this is great! I will review in more detail today and also respond to your comments. Indeed, there is some cruft from some of the Apache Impala code that was originally imported into parquet-cpp. I have some hashing work I want to do in some places (e.g. deduplicating strings when converting to pandas), so this should put the hashing code on a more stable footing. |
cc @timarmstrong if you can comment on what you've seen as far as CPUs without SSE4.2 support. We need to support arm64 so we can't hard-code sse4.2 instructions, but we should optimistically use them if they are available. |
Ideally I would like to enable SSE4.2 at compile-time on all x86-64 targets (so just using preprocessor switches based on compiler-enabled macros, without either ARROW_USE_SSE or cpuid-based runtime switching). Also there are intrinsic functions that can avoid hand-written inline asm. |
Got it. I would be ok with adding |
Ok, I enabled SSE4.2 unconditionally on x86-64. The non-small string benchmarks show a small improvement (~10%). Will update numbers above. |
Again a tedious compiler error:
Should I just remove HashException? It doesn't seem used. |
Yes go ahead |
e192ae4
to
a4fbdc2
Compare
As an aside, at some point we might investigate if we should put a Bloom filter in front of some of our hash tables to minimize the amount of probing that's necessary. We do have a Bloom filter that came along with the parquet-cpp merge, but it has virtual calls in the hot path which wouldn't be ideal for this https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter.h#L34. NB: Bloom filters would probably only be useful in certain algorithms, like vector set membership queries ( |
Yeah only quite old CPUs would be missing SSE4.2. We do occasionally hear from people running on such systems, often AMD processors because those got support later. I don't think these are production systems for the most part, but rather "I'm trying out Impala on some old servers we had lying around". The convoluted code with #defines and inline assembly was came out of a requirement to build 3 kinds of artifacts:
The inline assembly was a workaround for the fact you can't use the intrinsics without enabling the -m flag. The problem with setting the -m flag is that the compiler can emit instructions in other places where they're not guarded by the runtime checks. Anyway, that's the story. Not sure if the solution chosen was ideal but it's not total madness. If you can change to compile-time dispatching to the different implementations, which it looks like this PR is doing, then you avoid a lot of the complication. |
Codecov Report
@@ Coverage Diff @@
## master #3005 +/- ##
==========================================
- Coverage 86.71% 86.71% -0.01%
==========================================
Files 493 493
Lines 69891 69890 -1
==========================================
- Hits 60607 60606 -1
Misses 9188 9188
Partials 96 96
Continue to review full report at Codecov.
|
I've added a "dual CRC" hashing scheme that does two independent CRC computations in parallel and returns a 64-bit result. I expect it to be quite a bit faster on Intel CPUs which are able to do several CRCs at once (my AMD CPU doesn't). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very minimal initial review, I'm still looking at the whole implementation just wanted to point I'm on it.
Status Append(const char* value) { | ||
return Append(reinterpret_cast<const uint8_t*>(value)); | ||
} | ||
|
||
Status Append(util::string_view view) { | ||
#ifndef NDEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make CheckValueSize inline, defined in the header, and move the #ifdef into the body of the function, you will not need to use the macro at every callsite. The compiler should elide the call to the empty function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately DCHECK macros are forbidden in header files.
Note I may try to refactor a couple of things still (hence the "WIP" in the issue title), but the algorithms and general philosophy will remain the same. |
I'll wait for your final refactor. |
With regards to bloomfilter, I suspect it wont provide benefits in this specific case since we're acting in a GetOrInsert mode, i.e. in both cases (positive or negative) we're still going to hit the cacheline(s) affected by the key to either retrieve the index, or insert the new mapping. In selection/filtering/join mode, it's a different story (assuming we keep a bloom filter per array). |
Refactor in progress. I may change a bit more later. |
@pitrou I saw that you toyed with abseil previously, did you managed to get make it work? It'd be interesting to compare with SwissTable. |
(Note that abseil might also be heavily useful in the future when we'll have to deal with civil time and any time/calendar related computations). |
The problem with Abseil is that we can't use it until the Python packaging ecosystem moves to something newer than the manylinux1 spec. Unfortunately the manylinux2010 spec is not fully implemented. |
As for dates and times, PR #2952 vendors a |
I still think it would be OK to break with manylinux1 and use CentOS 6 as the base build image for wheels (as many other projects have already done, so we wouldn't be the first). A couple people have already registered criticisms with me privately about why we are stressing about CentOS 5 compatibility if we are getting no financial support from organizations that need it. I would rather hear from the people for whom RHEL5/CentOS5 is a requirement |
Moving this (manylinux) discussion into #2501. |
I'm gonna call this PR done. Any remaining refactor will be deferred to a later PR. I still need to fix the CI linting failure, though :-) |
67ddf62
to
26159e9
Compare
Working on this review today. Needs a rebase I guess after d06d0d0 |
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I left a number of comments / questions.
I would be satisfied leaving things to follow up work. The memory doubling issue with the memo tables is probably the most significant thing. This puts us in a much cleaner and more sustainable place
Let me know if you want to make any further changes, and we can get this merged today
hash_table_load_threshold_ = | ||
static_cast<int64_t>(static_cast<double>(capacity) * kMaxHashTableLoad); | ||
// Initialize hash table | ||
// XXX should we let the user pass additional size heuristics? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but can look into this later
if (array.IsNull(i)) { | ||
RETURN_NOT_OK(AppendNull()); | ||
} else { | ||
RETURN_NOT_OK(Append(typed_array.GetValue(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add Reserve call and use Unsafe* methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't think I will bother with this now, as it would require adding some methods to the builder and array classes. This is as in the old code anyway.
// ScalarHelper specialization for reals | ||
|
||
static bool CompareScalars(Scalar u, Scalar v) { | ||
if (std::isnan(u)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be unbranched, though probably not worth worrying about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added the specialization for floats, but doing associative lookups on floats is not very common anyway.
explicit ScalarMemoTable(int64_t entries = 0) | ||
: hash_table_(static_cast<uint64_t>(entries)) {} | ||
|
||
int32_t Get(const Scalar value) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const Scalar&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, these are always C scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok wasn't sure if string_view or another type would appear here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that string_view
is designed to be a very small struct, so probably gets passed in a couple registers.
using HashTableEntry = typename HashTable<Payload>::Entry; | ||
HashTableType hash_table_; | ||
std::vector<int32_t> offsets_; | ||
std::string values_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm: values_
could grow large; additionally, in a number of places in the codebase, we will want the result of building the hash table to be "released" into an arrow::BinaryArray
with zero copy. Can this be changed to use arrow::BufferBuilder
https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L372? You'd perhaps want to track the reserved size so that you can call UnsafeAppend
and skip Status checks (though it may not matter, benchmarks will settle the case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the problem is that the hash kernel, as I understand it, is supposed to be reusable. But if we detach the current buffer, the hash table can't be fed anymore (we can't grow it anymore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this for follow up work for now. I'll take a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that reuse of a hash table is already happening, actually, when building delta dictionaries... We could freeze the buffer when exported and follow up in a new buffer.
RETURN_NOT_OK( | ||
AllocateBuffer(pool, TypeTraits<T>::bytes_required(dict_length), &dict_buffer)); | ||
memo_table.CopyValues(static_cast<int32_t>(start_offset), | ||
reinterpret_cast<c_type*>(dict_buffer->mutable_data())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments about this in BinaryMemoTable
. I think we should try to avoid memory doubling here -- the memory doubling issue is more significant than the copying performance (which as you point out is not significant relative to the cost of building the table). It is true that in many cases the dictionary is small, but that won't stop people from creating multi-gigabyte dictionaries (and they do)
DCHECK_EQ(raw_offsets[0], 0); | ||
RETURN_NOT_OK(AllocateBuffer(pool, raw_offsets[dict_length], &dict_data)); | ||
memo_table.CopyValues(static_cast<int32_t>(start_offset), dict_data->size(), | ||
dict_data->mutable_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above. It would be nice to be able to call
... memo_table.DetachOffsets()
... memo_table.DetachValues()
1. Get rid of all macros and sprinkled out hash table handling code 2. Improve performance by more careful selection of hash functions (and better collision resolution strategy) Integer hashing benefits from a very fast specialization. Small string hashing benefits from a fast specialization with less branches and less computation. Generic string hashing falls back on Murmur2-64, which has probably sufficient performance given the typical distribution of string key length. 3. Add some tests and benchmarks
As far as I'm concerned, this PR is ready :-) |
Get rid of all macros and sprinkled out hash table handling code
Improve performance by more careful selection of hash functions
(and better collision resolution strategy)
Integer hashing benefits from a very fast specialization.
Small string hashing benefits from a fast specialization with less branches
and less computation.
Generic string hashing falls back on hardware CRC32 or Murmur2-64, which has probably sufficient
performance given the typical distribution of string key length.