Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Add scalar_hash function #17211

Open
1 task done
asfimport opened this issue May 31, 2020 · 12 comments · May be fixed by #39836
Open
1 task done

[C++][Compute] Add scalar_hash function #17211

asfimport opened this issue May 31, 2020 · 12 comments · May be fixed by #39836

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 31, 2020

The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children

Reporter: Wes McKinney / @wesm
Assignee: Aldrin Montana / @drin

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8991. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Right, this is what I had proposed in ARROW-3978 as well.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
If you're interested in working on this, I'll be tied up with some other things for the next few days, otherwise I'll tackle it after that

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
It'll depend on the other things I have on my plate. Is this a dependency of something else?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
It's needed for implementing hash aggregations (and any other grouping-type algorithm). No need to rearrange priorities, just wanted to mention.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Seems like this could be implemented now?

@asfimport
Copy link
Collaborator Author

Niranda Perera / @nirandaperera:
Is there anyone working on this ATM? If not, I can take this up.
@wesm is there a preference of a hash function, ex: Murmur etc?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
The underlying idea is to reuse the hash functions already used for hash kernels.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Don't we already have this for the group-by aggregation and joining? As in, the algorithms may already be there, you would just have to expose a scalar kernel. (Alternatively, since we already have those functions, is this still valuable?)

@asfimport
Copy link
Collaborator Author

Aldrin Montana / @drin:
this PR is ready for review if anyone has time

@drin
Copy link
Contributor

drin commented Feb 27, 2023

converted the PR to a draft; I can come back to it in about a week

@drin
Copy link
Contributor

drin commented Mar 14, 2023

okay, it was a bit longer than I hoped for, but I'll try to pick this back up next week

drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jan 12, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jan 29, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
drin added a commit to drin/arrow that referenced this issue Jan 29, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
drin added a commit to drin/arrow that referenced this issue Jan 29, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jan 29, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jun 28, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
drin added a commit to drin/arrow that referenced this issue Jun 28, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Jun 28, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
@drin
Copy link
Contributor

drin commented Jun 28, 2024

For documentation purposes, the scalar_hash function implementation this issue covers is a generalized compute function that can take any type of arrow::Array.

At the time I started this work:

For this issue, I'll try to simply close the loop by finishing implementation of a scalar_hash function that works on any type of arrow::Array by converting it to a KeyColumnArray even if that means flattening it from a nested structure to a non-nested structure.

The reason I mention this, is that the documentation for KeyColumnArray says:

A "key" column is a non-nested, non-union column \see KeyColumnMetadata

This may become a bit misleading or inaccurate once this implementation is complete. Then, either:

  1. the language will need to be updated to match usage
    • in case a KeyColumnArray is logically a nested array that is physically a non-nested array for the purposes of hashing
    • maybe this is what is already is and no real changes are necessary?
  2. the code should be re-architected for clarity
    • in case it is better to create a similar, but differently named structure like HashStructArray that is a physically similar to KeyColumnArray but semantically treated like a StructArray that can be hashed directly.

I will create a new issue and circulate a discussion email to the dev ML when the time comes.

drin added a commit to drin/arrow that referenced this issue Oct 7, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
drin added a commit to drin/arrow that referenced this issue Oct 7, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
drin added a commit to drin/arrow that referenced this issue Oct 7, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
drin added a commit to drin/arrow that referenced this issue Oct 7, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 10, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 16, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 16, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 16, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 16, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 17, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 17, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 17, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 17, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 20, 2024
This commit ports the latest state of scalar_hash kernels without
pulling a long development history.

This kernel is an element-wise function that uses an xxHash-like
algorithm, prioritizing speed and not suitable for cryptographic
purposes. The function is implemented by the `FastHashScalar` struct
which is templated by the output type (which is assumed to be either
UInt32 or UInt64, but there is no validation of that at the moment).

The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc
file as a reference (in cpp/src/arrow/util/), but only covers various
input types and the key hashing functions (from key_hash.h).

The tests in scalar_hash_test.cc use a simplified version of hashing
based on what is implemented in the key_hash.cc. The idea being that the
high-level entry points for high-level types should eventually reach an
expected application of the low-level hash functions on simple data
types; the tests do this exact comparison. At the moment, the tests pass
for simple cases, but they do not work for nested types with non-trivial
row layouts (e.g. ListTypes).

Issue: ARROW-8991
Issue: apacheGH-17211
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 20, 2024
This commit pulls the latest changes to key_hash.h and implementations
in light_array without the burden of a long development history.

The only change in key_hash.h is the addition of a friend function which
is used in scalar_hash_test.cc.

Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the
use of ArraySpan, which was introduced after light_array was written;
and (2) the need for a KeyColumnArray to allocate data for the purposes
of interpreting (or decoding) the structure of a nested type.

The main reason for the 2nd scenario is that a ListArray may have many
values represented in a single row which should be hashed together;
however, if the ListArray has a nested ListArray or other type, the row
may have further structure. In the simplest interpretation, only the
highest-level structure (the "outer" ListArray) needs to be preserved,
and any further nested structures must be explicitly handled by custom
kernels (or any future options, etc. that are upstreamed into Arrow).

In trying to efficiently interpret complex nested types, ArraySpan can
be useful because it is non-owning, thus the main reason for the 1st
aforementioned scenario.

Although unfinished, any tests added to light_array_test.cc should
accommodate the 2 scenarios above.
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 20, 2024
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: apacheGH-17211
Issue: ARROW-8991
kszucs pushed a commit to kszucs/arrow that referenced this issue Dec 20, 2024
This commit includes additions to the general hashing benchmarks that
cover the use of hashing functions in key_hash.h without carrying the
burden of a long dev history.

Some existing benchmark names were changed to distinguish between the
use of Int32 and Int64 types, new benchmarks were added that use the
functions declared in key_hash.h. The reason the new benchmarks are
added is because it is claimed they prioritize speed over cryptography
as they're primarily used for join algorithms and other processing
tasks, which the hashing benchmark can now provide observability for.

Issue: apacheGH-17211
Issue: ARROW-8991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants