[C++][Compute] Add scalar_hash function #17211

asfimport · 2020-05-31T14:58:05Z

The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children

Reporter: Wes McKinney / @wesm
Assignee: Aldrin Montana / @drin

_{Note: This issue was originally created as ARROW-8991. Please see the migration documentation for further details.}

asfimport · 2020-06-03T19:15:37Z

Antoine Pitrou / @pitrou:
Right, this is what I had proposed in ARROW-3978 as well.

asfimport · 2020-06-03T19:37:10Z

Wes McKinney / @wesm:
If you're interested in working on this, I'll be tied up with some other things for the next few days, otherwise I'll tackle it after that

asfimport · 2020-06-03T19:38:59Z

Antoine Pitrou / @pitrou:
It'll depend on the other things I have on my plate. Is this a dependency of something else?

asfimport · 2020-06-03T19:41:53Z

Wes McKinney / @wesm:
It's needed for implementing hash aggregations (and any other grouping-type algorithm). No need to rearrange priorities, just wanted to mention.

asfimport · 2021-06-21T23:45:53Z

Wes McKinney / @wesm:
Seems like this could be implemented now?

asfimport · 2021-11-03T20:09:45Z

Niranda Perera / @nirandaperera:
Is there anyone working on this ATM? If not, I can take this up.
@wesm is there a preference of a hash function, ex: Murmur etc?

asfimport · 2021-11-03T20:10:54Z

Antoine Pitrou / @pitrou:
The underlying idea is to reuse the hash functions already used for hash kernels.

asfimport · 2021-11-03T20:11:25Z

Neal Richardson / @nealrichardson:
Don't we already have this for the group-by aggregation and joining? As in, the algorithms may already be there, you would just have to expose a scalar kernel. (Alternatively, since we already have those functions, is this still valuable?)

asfimport · 2022-08-22T17:22:50Z

Aldrin Montana / @drin:
this PR is ready for review if anyone has time

drin · 2023-02-27T17:21:05Z

converted the PR to a draft; I can come back to it in about a week

drin · 2023-03-14T23:50:58Z

okay, it was a bit longer than I hoped for, but I'll try to pick this back up next week

This commit ports the latest state of scalar_hash kernels without pulling a long development history. This kernel is an element-wise function that uses an xxHash-like algorithm, prioritizing speed and not suitable for cryptographic purposes. The function is implemented by the `FastHashScalar` struct which is templated by the output type (which is assumed to be either UInt32 or UInt64, but there is no validation of that at the moment). The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc file as a reference (in cpp/src/arrow/util/), but only covers various input types and the key hashing functions (from key_hash.h). The tests in scalar_hash_test.cc use a simplified version of hashing based on what is implemented in the key_hash.cc. The idea being that the high-level entry points for high-level types should eventually reach an expected application of the low-level hash functions on simple data types; the tests do this exact comparison. At the moment, the tests pass for simple cases, but they do not work for nested types with non-trivial row layouts (e.g. ListTypes). Issue: ARROW-8991 Issue: apacheGH-17211

This commit pulls the latest changes to key_hash.h and implementations in light_array without the burden of a long development history. The only change in key_hash.h is the addition of a friend function which is used in scalar_hash_test.cc. Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the use of ArraySpan, which was introduced after light_array was written; and (2) the need for a KeyColumnArray to allocate data for the purposes of interpreting (or decoding) the structure of a nested type. The main reason for the 2nd scenario is that a ListArray may have many values represented in a single row which should be hashed together; however, if the ListArray has a nested ListArray or other type, the row may have further structure. In the simplest interpretation, only the highest-level structure (the "outer" ListArray) needs to be preserved, and any further nested structures must be explicitly handled by custom kernels (or any future options, etc. that are upstreamed into Arrow). In trying to efficiently interpret complex nested types, ArraySpan can be useful because it is non-owning, thus the main reason for the 1st aforementioned scenario. Although unfinished, any tests added to light_array_test.cc should accommodate the 2 scenarios above.

This commit includes changes to register a new compute function without the burden of a long development history. The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in compilation as it is used by the new Hash64 function defined in api_scalar.[h,cc]. The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes scalar_hash_test.cc in compilation for tests and it also adds a new benchmark binary that is implemented by scalar_hash_benchmark.cc. The registry files are updated to register the kernel implementations in scalar_hash.cc with the function definitions in api_scalar.[h,cc]. Finally, docs/source/cpp/compute.rst adds documentation for the Hash64 function. Issue: apacheGH-17211 Issue: ARROW-8991

This commit includes additions to the general hashing benchmarks that cover the use of hashing functions in key_hash.h without carrying the burden of a long dev history. Some existing benchmark names were changed to distinguish between the use of Int32 and Int64 types, new benchmarks were added that use the functions declared in key_hash.h. The reason the new benchmarks are added is because it is claimed they prioritize speed over cryptography as they're primarily used for join algorithms and other processing tasks, which the hashing benchmark can now provide observability for. Issue: apacheGH-17211 Issue: ARROW-8991