Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2653: [C++] Refactor hash table support #3005

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 20, 2018

  1. Get rid of all macros and sprinkled out hash table handling code

  2. Improve performance by more careful selection of hash functions
    (and better collision resolution strategy)

Integer hashing benefits from a very fast specialization.
Small string hashing benefits from a fast specialization with less branches
and less computation.
Generic string hashing falls back on hardware CRC32 or Murmur2-64, which has probably sufficient
performance given the typical distribution of string key length.

  1. Add some tests and benchmarks

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Dict building benchmark (gcc 7.3.0, AMD Ryzen 7):

  • before:
[...]
-----------------------------------------------------------------------------------------------
Benchmark                                                        Time           CPU Iterations
-----------------------------------------------------------------------------------------------
BM_BuildInt64DictionaryArrayRandom/repeats:2                   185 us        185 us       3701   411.539MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2                   187 us        187 us       3701   407.759MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_mean              186 us        186 us       3701   409.649MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_median            186 us        186 us       3701   409.649MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_stddev              1 us          1 us       3701   2.67294MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2               187 us        187 us       3696   408.102MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2               183 us        183 us       3696   416.262MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_mean          185 us        185 us       3696   412.182MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_median        185 us        185 us       3696   412.182MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_stddev          3 us          3 us       3696   5.77062MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2                  194 us        194 us       3617   394.101MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2                  193 us        193 us       3617   394.449MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_mean             194 us        194 us       3617   394.275MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_median           194 us        194 us       3617   394.275MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_stddev             0 us          0 us       3617   251.506kB/s
BM_BuildStringDictionaryArray/repeats:2                        524 us        524 us       1342    190.84MB/s
BM_BuildStringDictionaryArray/repeats:2                        525 us        525 us       1342    190.48MB/s
BM_BuildStringDictionaryArray/repeats:2_mean                   524 us        524 us       1342    190.66MB/s
BM_BuildStringDictionaryArray/repeats:2_median                 524 us        524 us       1342    190.66MB/s
BM_BuildStringDictionaryArray/repeats:2_stddev                   1 us          1 us       1342   260.545kB/s
  • after:
-----------------------------------------------------------------------------------------------
Benchmark                                                        Time           CPU Iterations
-----------------------------------------------------------------------------------------------
BM_BuildInt64DictionaryArrayRandom/repeats:2                   168 us        168 us       4164   453.871MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2                   168 us        168 us       4164   453.462MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_mean              168 us        168 us       4164   453.667MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_median            168 us        168 us       4164   453.667MB/s
BM_BuildInt64DictionaryArrayRandom/repeats:2_stddev              0 us          0 us       4164   295.764kB/s
BM_BuildInt64DictionaryArraySequential/repeats:2               159 us        159 us       4392   478.758MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2               160 us        160 us       4392   478.048MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_mean          160 us        159 us       4392   478.403MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_median        160 us        159 us       4392   478.403MB/s
BM_BuildInt64DictionaryArraySequential/repeats:2_stddev          0 us          0 us       4392   514.183kB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2                  160 us        160 us       4372   476.579MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2                  160 us        160 us       4372   476.578MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_mean             160 us        160 us       4372   476.578MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_median           160 us        160 us       4372   476.578MB/s
BM_BuildInt64DictionaryArraySimilar/repeats:2_stddev             0 us          0 us       4372    797.195B/s
BM_BuildStringDictionaryArray/repeats:2                        367 us        367 us       1912   272.437MB/s
BM_BuildStringDictionaryArray/repeats:2                        367 us        367 us       1912    272.62MB/s
BM_BuildStringDictionaryArray/repeats:2_mean                   367 us        367 us       1912   272.529MB/s
BM_BuildStringDictionaryArray/repeats:2_median                 367 us        367 us       1912   272.529MB/s
BM_BuildStringDictionaryArray/repeats:2_stddev                   0 us          0 us       1912     132.8kB/s

Bottom line: the new code is around 10% to 40% faster.

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Compute benchmark:

  • before:
---------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time           CPU Iterations
---------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                       1570 us       1569 us        898   2.48717GB/s
BM_BuildStringDictionary/min_time:1.000                                 4435 us       4434 us        316   68.0927MB/s
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time             54623 us      54602 us         26    2.2884GB/s
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time          100073 us     100038 us         14   1.24908GB/s
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time         136464 us     136424 us         10   937.977MB/s
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time       804117 us     803776 us          2   159.181MB/s
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time           81952 us      81918 us         17   1.52528GB/s
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time        134001 us     133959 us         10   955.218MB/s
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time       177680 us     177629 us          8   720.398MB/s
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time     972930 us     972477 us          2   131.561MB/s
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time           209702 us     209629 us          7   762.988MB/s
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time         266693 us     266613 us          5   599.942MB/s
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time        362979 us     362893 us          4   440.797MB/s
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time     2886708 us    2885530 us          1   55.4265MB/s
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time          737258 us     736968 us          2   2.11934GB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time        867057 us     866734 us          2   1.80207GB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      1143734 us    1143263 us          1   1.36614GB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    6824159 us    6821196 us          1   234.461MB/s
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time            11718 us      11713 us        119   1.33345GB/s
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time          32380 us      32371 us         44    494.13MB/s
  • after:
---------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time           CPU Iterations
---------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                       1352 us       1352 us       1033   2.88675GB/s
BM_BuildStringDictionary/min_time:1.000                                 5185 us       5182 us        270   58.2594MB/s
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time             30674 us      30593 us         46   4.07515GB/s
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time           32040 us      31970 us         44   3.90134GB/s
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time          43753 us      43733 us         32   2.85697GB/s
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time       516782 us     516368 us          3   247.686MB/s
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time           48618 us      48597 us         29   2.57107GB/s
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time         48776 us      48754 us         29   2.56271GB/s
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time        63645 us      63616 us         22   1.96402GB/s
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time     614967 us     614258 us          2   208.141MB/s
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time           212362 us     212290 us          7    753.43MB/s
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time         220632 us     220554 us          6   725.189MB/s
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time        338742 us     338649 us          4   472.335MB/s
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time     2683194 us    2682169 us          1   59.6304MB/s
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time          700496 us     700232 us          2   2.23056GB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time        792821 us     792524 us          2   1.97081GB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      1051323 us    1050941 us          1   1.48622GB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    5878603 us    5875782 us          1   272.173MB/s
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time            13559 us      13555 us        103   1.15233GB/s
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time          23543 us      23536 us         59   679.612MB/s

Bottom line: hash kernels are 2x faster on integers and competitive on strings.

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

For the record, the performance of the underlying hash functions:

----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
BM_HashIntegers/repeats:1              12 us         12 us      56757    12.076GB/s
BM_HashSmallStrings/repeats:1          80 us         80 us       8729   2.54535GB/s
BM_HashMediumStrings/repeats:1        465 us        465 us       1506   2.78903GB/s
BM_HashLargeStrings/repeats:1         318 us        318 us       2211   6.03259GB/s

The integer benchmark uses 64-bit ints, so it means we're hashing 1.5 billion 64-bit ints per second.

@pitrou pitrou force-pushed the ARROW-2653 branch 2 times, most recently from 74e86cc to 3d46455 Compare November 20, 2018 16:18
@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

I would like to refactor/simplify hardware CRC32 support but there's some intriguing comment in sse-util.h about Impala and SSE4.2-less CPUs. What's the status of that? All recent x86-64 CPUs have SSE4.2 support. Both IMPALA-1399 and IMPALA-1646 are about extremely old CPUs (probably 10 years old by now or more).

@wesm @dhecht

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Also I don't understand why ARROW_USE_SSE is off by default, but we still do macro-based switching in sse-util.h. That sounds convoluted.

@wesm
Copy link
Member

wesm commented Nov 20, 2018

@pitrou this is great! I will review in more detail today and also respond to your comments. Indeed, there is some cruft from some of the Apache Impala code that was originally imported into parquet-cpp.

I have some hashing work I want to do in some places (e.g. deduplicating strings when converting to pandas), so this should put the hashing code on a more stable footing.

@wesm
Copy link
Member

wesm commented Nov 20, 2018

cc @timarmstrong if you can comment on what you've seen as far as CPUs without SSE4.2 support.

We need to support arm64 so we can't hard-code sse4.2 instructions, but we should optimistically use them if they are available.

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Ideally I would like to enable SSE4.2 at compile-time on all x86-64 targets (so just using preprocessor switches based on compiler-enabled macros, without either ARROW_USE_SSE or cpuid-based runtime switching). Also there are intrinsic functions that can avoid hand-written inline asm.

@wesm
Copy link
Member

wesm commented Nov 20, 2018

Got it. I would be ok with adding -msse4.2 by default when compiling on x86-64. I haven't tried running the Arrow unit tests on arm64 lately (I've tested both on a Tegra X1 and Raspberry Pi in the past) so after we do this someone should check that things still work there

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Ok, I enabled SSE4.2 unconditionally on x86-64. The non-small string benchmarks show a small improvement (~10%). Will update numbers above.

@pitrou
Copy link
Member Author

pitrou commented Nov 20, 2018

Again a tedious compiler error:

/Users/travis/build/apache/arrow/cpp/src/arrow/compute/kernels/hash.cc:70:12: error: unused member function 'HashException' [-Werror,-Wunused-member-function]
  explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid)
           ^

Should I just remove HashException? It doesn't seem used.

@wesm
Copy link
Member

wesm commented Nov 20, 2018

Yes go ahead

@pitrou pitrou force-pushed the ARROW-2653 branch 2 times, most recently from e192ae4 to a4fbdc2 Compare November 20, 2018 21:04
@wesm
Copy link
Member

wesm commented Nov 20, 2018

As an aside, at some point we might investigate if we should put a Bloom filter in front of some of our hash tables to minimize the amount of probing that's necessary. We do have a Bloom filter that came along with the parquet-cpp merge, but it has virtual calls in the hot path which wouldn't be ideal for this https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter.h#L34.

NB: Bloom filters would probably only be useful in certain algorithms, like vector set membership queries (a.isin(b)) or joins.

cc @fsaintjacques

@timarmstrong
Copy link

Yeah only quite old CPUs would be missing SSE4.2. We do occasionally hear from people running on such systems, often AMD processors because those got support later. I don't think these are production systems for the most part, but rather "I'm trying out Impala on some old servers we had lying around".

The convoluted code with #defines and inline assembly was came out of a requirement to build 3 kinds of artifacts:

  • A compiled binary that can run on non-SSE4.2 systems (so is not compiled with -msse4.2) but switch to using SSE4.2 instructions at runtime. Same for AVX, AVX2, etc.
  • Cross-compiled LLVM IR with SSE4.2 support, i.e. -msse4.2
  • Cross-compiled LLVM IR without SSE4.2 support, i.e. without -msse4.2

The inline assembly was a workaround for the fact you can't use the intrinsics without enabling the -m flag. The problem with setting the -m flag is that the compiler can emit instructions in other places where they're not guarded by the runtime checks.

Anyway, that's the story. Not sure if the solution chosen was ideal but it's not total madness.

If you can change to compile-time dispatching to the different implementations, which it looks like this PR is doing, then you avoid a lot of the complication.

@codecov-io
Copy link

codecov-io commented Nov 21, 2018

Codecov Report

Merging #3005 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3005      +/-   ##
==========================================
- Coverage   86.71%   86.71%   -0.01%     
==========================================
  Files         493      493              
  Lines       69891    69890       -1     
==========================================
- Hits        60607    60606       -1     
  Misses       9188     9188              
  Partials       96       96
Impacted Files Coverage Δ
cpp/src/arrow/util/bit-util-test.cc 99.56% <0%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc6da3a...0c2dcc3. Read the comment docs.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

I've added a "dual CRC" hashing scheme that does two independent CRC computations in parallel and returns a 64-bit result. I expect it to be quite a bit faster on Intel CPUs which are able to do several CRCs at once (my AMD CPU doesn't).

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very minimal initial review, I'm still looking at the whole implementation just wanted to point I'm on it.

cpp/src/arrow/builder.h Outdated Show resolved Hide resolved
Status Append(const char* value) {
return Append(reinterpret_cast<const uint8_t*>(value));
}

Status Append(util::string_view view) {
#ifndef NDEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make CheckValueSize inline, defined in the header, and move the #ifdef into the body of the function, you will not need to use the macro at every callsite. The compiler should elide the call to the empty function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately DCHECK macros are forbidden in header files.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

Note I may try to refactor a couple of things still (hence the "WIP" in the issue title), but the algorithms and general philosophy will remain the same.

@fsaintjacques
Copy link
Contributor

I'll wait for your final refactor.

@fsaintjacques
Copy link
Contributor

With regards to bloomfilter, I suspect it wont provide benefits in this specific case since we're acting in a GetOrInsert mode, i.e. in both cases (positive or negative) we're still going to hit the cacheline(s) affected by the key to either retrieve the index, or insert the new mapping.

In selection/filtering/join mode, it's a different story (assuming we keep a bloom filter per array).

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

Refactor in progress. I may change a bit more later.

@fsaintjacques
Copy link
Contributor

@pitrou I saw that you toyed with abseil previously, did you managed to get make it work? It'd be interesting to compare with SwissTable.

@fsaintjacques
Copy link
Contributor

(Note that abseil might also be heavily useful in the future when we'll have to deal with civil time and any time/calendar related computations).

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

The problem with Abseil is that we can't use it until the Python packaging ecosystem moves to something newer than the manylinux1 spec. Unfortunately the manylinux2010 spec is not fully implemented.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

As for dates and times, PR #2952 vendors a date.h implementation ;-)

@wesm
Copy link
Member

wesm commented Nov 21, 2018

I still think it would be OK to break with manylinux1 and use CentOS 6 as the base build image for wheels (as many other projects have already done, so we wouldn't be the first). A couple people have already registered criticisms with me privately about why we are stressing about CentOS 5 compatibility if we are getting no financial support from organizations that need it. I would rather hear from the people for whom RHEL5/CentOS5 is a requirement

@fsaintjacques
Copy link
Contributor

Moving this (manylinux) discussion into #2501.

@pitrou
Copy link
Member Author

pitrou commented Nov 22, 2018

I'm gonna call this PR done. Any remaining refactor will be deferred to a later PR. I still need to fix the CI linting failure, though :-)

@pitrou pitrou changed the title [WIP] ARROW-2653: [C++] Refactor hash table support ARROW-2653: [C++] Refactor hash table support Nov 22, 2018
@pitrou pitrou force-pushed the ARROW-2653 branch 2 times, most recently from 67ddf62 to 26159e9 Compare November 22, 2018 11:22
@wesm
Copy link
Member

wesm commented Nov 22, 2018

Working on this review today. Needs a rebase I guess after d06d0d0

@pitrou
Copy link
Member Author

pitrou commented Nov 22, 2018

Rebased.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I left a number of comments / questions.

I would be satisfied leaving things to follow up work. The memory doubling issue with the memo tables is probably the most significant thing. This puts us in a much cleaner and more sustainable place

Let me know if you want to make any further changes, and we can get this merged today

cpp/src/arrow/array.h Show resolved Hide resolved
hash_table_load_threshold_ =
static_cast<int64_t>(static_cast<double>(capacity) * kMaxHashTableLoad);
// Initialize hash table
// XXX should we let the user pass additional size heuristics?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but can look into this later

if (array.IsNull(i)) {
RETURN_NOT_OK(AppendNull());
} else {
RETURN_NOT_OK(Append(typed_array.GetValue(i)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add Reserve call and use Unsafe* methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look at this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't think I will bother with this now, as it would require adding some methods to the builder and array classes. This is as in the old code anyway.

cpp/src/arrow/builder.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/hash.cc Outdated Show resolved Hide resolved
// ScalarHelper specialization for reals

static bool CompareScalars(Scalar u, Scalar v) {
if (std::isnan(u)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be unbranched, though probably not worth worrying about

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added the specialization for floats, but doing associative lookups on floats is not very common anyway.

explicit ScalarMemoTable(int64_t entries = 0)
: hash_table_(static_cast<uint64_t>(entries)) {}

int32_t Get(const Scalar value) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const Scalar&?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, these are always C scalars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok wasn't sure if string_view or another type would appear here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that string_view is designed to be a very small struct, so probably gets passed in a couple registers.

using HashTableEntry = typename HashTable<Payload>::Entry;
HashTableType hash_table_;
std::vector<int32_t> offsets_;
std::string values_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm: values_ could grow large; additionally, in a number of places in the codebase, we will want the result of building the hash table to be "released" into an arrow::BinaryArray with zero copy. Can this be changed to use arrow::BufferBuilder https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L372? You'd perhaps want to track the reserved size so that you can call UnsafeAppend and skip Status checks (though it may not matter, benchmarks will settle the case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the problem is that the hash kernel, as I understand it, is supposed to be reusable. But if we detach the current buffer, the hash table can't be fed anymore (we can't grow it anymore).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this for follow up work for now. I'll take a closer look

Copy link
Member Author

@pitrou pitrou Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that reuse of a hash table is already happening, actually, when building delta dictionaries... We could freeze the buffer when exported and follow up in a new buffer.

RETURN_NOT_OK(
AllocateBuffer(pool, TypeTraits<T>::bytes_required(dict_length), &dict_buffer));
memo_table.CopyValues(static_cast<int32_t>(start_offset),
reinterpret_cast<c_type*>(dict_buffer->mutable_data()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about this in BinaryMemoTable. I think we should try to avoid memory doubling here -- the memory doubling issue is more significant than the copying performance (which as you point out is not significant relative to the cost of building the table). It is true that in many cases the dictionary is small, but that won't stop people from creating multi-gigabyte dictionaries (and they do)

DCHECK_EQ(raw_offsets[0], 0);
RETURN_NOT_OK(AllocateBuffer(pool, raw_offsets[dict_length], &dict_data));
memo_table.CopyValues(static_cast<int32_t>(start_offset), dict_data->size(),
dict_data->mutable_data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. It would be nice to be able to call

... memo_table.DetachOffsets()
... memo_table.DetachValues()

1. Get rid of all macros and sprinkled out hash table handling code

2. Improve performance by more careful selection of hash functions
   (and better collision resolution strategy)

Integer hashing benefits from a very fast specialization.
Small string hashing benefits from a fast specialization with less branches
and less computation.
Generic string hashing falls back on Murmur2-64, which has probably sufficient
performance given the typical distribution of string key length.

3. Add some tests and benchmarks
@pitrou
Copy link
Member Author

pitrou commented Nov 22, 2018

Let me know if you want to make any further changes

As far as I'm concerned, this PR is ready :-)

@wesm wesm closed this in eaf8d32 Nov 22, 2018
@pitrou pitrou deleted the ARROW-2653 branch November 22, 2018 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants