Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate hashing operations to pylibcudf #15418

Merged
merged 69 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
968aef5
hashing - initial
brandon-b-miller Apr 1, 2024
f4c953c
minor cleanup for now
brandon-b-miller Apr 1, 2024
eeee5ee
add hash top level function
brandon-b-miller Apr 1, 2024
0576423
begin tests
brandon-b-miller Apr 2, 2024
306ad1d
some untested code worth saving
brandon-b-miller Apr 3, 2024
a15dd45
tests run and fail
brandon-b-miller Apr 3, 2024
30b0f2b
todo
brandon-b-miller Apr 3, 2024
eeb4edb
Apply suggestions from code review
brandon-b-miller Apr 5, 2024
6762279
remove hash_id
brandon-b-miller Apr 5, 2024
8ab4afa
docs
brandon-b-miller Apr 5, 2024
ccf64d4
small lint
brandon-b-miller Apr 5, 2024
9ce384a
add DEFAULT_HASH_SEED from hpp
brandon-b-miller Apr 5, 2024
bed5792
fix xxhash_64_string test
brandon-b-miller Apr 5, 2024
992cba1
raise for unimplemented hash test functions on the python side for now
brandon-b-miller Apr 8, 2024
b61d2cc
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 10, 2024
2977b63
fix up some tests
brandon-b-miller Apr 10, 2024
c0bb09a
separate md5 test
brandon-b-miller Apr 10, 2024
874f317
cleanup
brandon-b-miller Apr 11, 2024
edcde76
more cleanup
brandon-b-miller Apr 11, 2024
3a442cb
begin hashing tests
brandon-b-miller Apr 12, 2024
6841de9
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 15, 2024
eef3616
fix up murmurhash3_x64_128 test, list struct error sha test
brandon-b-miller Apr 16, 2024
ab5870d
add mmh3_x86_32 tests that currently fail
brandon-b-miller Apr 16, 2024
41c0ae6
Apply suggestions from code review
brandon-b-miller May 2, 2024
3fdd04f
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 2, 2024
774b093
update cpp errors
brandon-b-miller May 2, 2024
2f131b3
address some reviews
brandon-b-miller May 3, 2024
af6e59d
uncomment xxhash_64
brandon-b-miller May 3, 2024
ee56145
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 3, 2024
2e6743e
add mmh3 to test_python_cudf
brandon-b-miller May 3, 2024
8d8bef9
fix murmurhash3_x86_32
brandon-b-miller May 3, 2024
5930c7a
add xxhash testing dependency
brandon-b-miller May 3, 2024
cedc89f
depandasify
brandon-b-miller May 3, 2024
840f3e4
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 15, 2024
17e63eb
merge latest/resolve conflicts/fix
brandon-b-miller May 16, 2024
751e5f3
fix pylibcudf tests
brandon-b-miller May 16, 2024
642b444
update dependencies
brandon-b-miller May 16, 2024
cbeb9f9
linting
brandon-b-miller May 16, 2024
174af9d
refactor
brandon-b-miller May 16, 2024
24c72a3
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 22, 2024
9f5355f
merge latest/resolve conflicts
brandon-b-miller Jun 28, 2024
3d10495
Merge branch 'branch-24.08' into pylibcudf-hashing
brandon-b-miller Jul 3, 2024
37a91bf
debug commit
brandon-b-miller Jul 8, 2024
a5ec407
merged but not building yet
brandon-b-miller Sep 30, 2024
bab6bb5
merge/resolve
brandon-b-miller Oct 2, 2024
b406b41
small updates
brandon-b-miller Oct 2, 2024
d730b6f
refactor/pass, missing a few tests
brandon-b-miller Oct 4, 2024
751a89c
extra test
brandon-b-miller Oct 4, 2024
d99b5cf
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 14, 2024
68c7a49
missing test
brandon-b-miller Oct 14, 2024
370405b
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 17, 2024
cdd41db
prune moves
brandon-b-miller Oct 17, 2024
2dc49b2
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 18, 2024
a6ded88
fixes
brandon-b-miller Oct 18, 2024
382c2dc
Update docs/cudf/source/user_guide/api_docs/pylibcudf/hashing.rst
brandon-b-miller Oct 22, 2024
a0f9d07
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 22, 2024
a55048a
Apply suggestions from code review
brandon-b-miller Oct 22, 2024
23cd5fe
combine sha/md5 tests
brandon-b-miller Oct 22, 2024
1a4cfad
struct and list tests, struct still fails
brandon-b-miller Oct 27, 2024
4c37de9
pass.
brandon-b-miller Oct 27, 2024
5423074
clean
brandon-b-miller Oct 27, 2024
f0ec39b
latest
brandon-b-miller Oct 27, 2024
46b27a1
style
brandon-b-miller Oct 28, 2024
60e5c4c
Update python/pylibcudf/pylibcudf/tests/conftest.py
brandon-b-miller Oct 28, 2024
dcd38c6
update docstrings
brandon-b-miller Oct 28, 2024
5d1c2f1
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 29, 2024
7f3157b
enforce uint32
brandon-b-miller Oct 30, 2024
08b8818
adjust doxygen tags
brandon-b-miller Oct 30, 2024
d0234d4
doc fixes
brandon-b-miller Oct 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cpp/src/hash/sha_hash.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -512,7 +512,8 @@ std::unique_ptr<column> sha_hash(table_view const& input,
CUDF_EXPECTS(
std::all_of(
input.begin(), input.end(), [](auto const& col) { return sha_leaf_type_check(col.type()); }),
"Unsupported column type for hash function.");
"Unsupported column type for hash function.",
cudf::data_type_error);

// Result column allocation and creation
auto begin = thrust::make_constant_iterator(Hasher::digest_size);
Expand Down
6 changes: 6 additions & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/hashing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
========
hashing
========
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved

.. automodule:: cudf._lib.pylibcudf.hashing
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
:members:
1 change: 1 addition & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This page provides API documentation for pylibcudf.
filling
gpumemoryview
groupby
hashing
join
lists
merge
Expand Down
28 changes: 20 additions & 8 deletions python/cudf/cudf/_lib/cpp/hash.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -7,40 +7,52 @@ from libcpp.vector cimport vector
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.table.table cimport table
from cudf._lib.cpp.table.table_view cimport table_view
from cudf._lib.exception_handler cimport cudf_exception_handler


cdef extern from "cudf/hashing.hpp" namespace "cudf::hashing" nogil:

cdef unique_ptr[column] murmurhash3_x86_32 "cudf::hashing::murmurhash3_x86_32" (
const table_view& input,
const uint32_t seed
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] murmurhash3_x64_128 "cudf::hashing::murmurhash3_x64_128" (
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
const table_view& input,
const uint64_t seed
) except +cudf_exception_handler

cdef unique_ptr[column] md5 "cudf::hashing::md5" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] sha1 "cudf::hashing::sha1" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] sha224 "cudf::hashing::sha224" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] sha256 "cudf::hashing::sha256" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] sha384 "cudf::hashing::sha384" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] sha512 "cudf::hashing::sha512" (
const table_view& input
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] xxhash_64 "cudf::hashing::xxhash_64" (
const table_view& input,
const uint64_t seed
) except +
) except +cudf_exception_handler

cpdef enum class hash_id(int):
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
HASH_IDENTITY
HASH_MURMUR3
HASH_SPARK_MURMUR3
HASH_MD5
Empty file.
45 changes: 19 additions & 26 deletions python/cudf/cudf/_lib/hash.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@ from libcpp.pair cimport pair
from libcpp.utility cimport move
from libcpp.vector cimport vector

from cudf._lib import pylibcudf

cimport cudf._lib.cpp.types as libcudf_types
from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.hash cimport (
from cudf._lib.cpp.partitioning cimport hash_partition as cpp_hash_partition
from cudf._lib.cpp.table.table cimport table
from cudf._lib.cpp.table.table_view cimport table_view
from cudf._lib.pylibcudf.hashing cimport (
md5,
murmurhash3_x86_32,
sha1,
Expand All @@ -20,9 +24,6 @@ from cudf._lib.cpp.hash cimport (
sha512,
xxhash_64,
)
from cudf._lib.cpp.partitioning cimport hash_partition as cpp_hash_partition
from cudf._lib.cpp.table.table cimport table
from cudf._lib.cpp.table.table_view cimport table_view
from cudf._lib.utils cimport columns_from_unique_ptr, table_view_from_columns


Expand Down Expand Up @@ -51,32 +52,24 @@ def hash_partition(list source_columns, object columns_to_hash,

@acquire_spill_lock()
def hash(list source_columns, str method, int seed=0):
cdef table_view c_source_view = table_view_from_columns(source_columns)
cdef unique_ptr[column] c_result
ctbl = pylibcudf.Table([c.to_pylibcudf(mode="read") for c in source_columns])
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
if method == "murmur3":
with nogil:
c_result = move(murmurhash3_x86_32(c_source_view, seed))
return Column.from_pylibcudf(murmurhash3_x86_32(ctbl, seed))
elif method == "xxhash64":
return Column.from_pylibcudf(xxhash_64(ctbl, seed))
elif method == "md5":
with nogil:
c_result = move(md5(c_source_view))
return Column.from_pylibcudf(md5(ctbl))
elif method == "sha1":
with nogil:
c_result = move(sha1(c_source_view))
return Column.from_pylibcudf(sha1(ctbl))
elif method == "sha224":
with nogil:
c_result = move(sha224(c_source_view))
return Column.from_pylibcudf(sha224(ctbl))
elif method == "sha256":
with nogil:
c_result = move(sha256(c_source_view))
return Column.from_pylibcudf(sha256(ctbl))
elif method == "sha384":
with nogil:
c_result = move(sha384(c_source_view))
return Column.from_pylibcudf(sha384(ctbl))
elif method == "sha512":
with nogil:
c_result = move(sha512(c_source_view))
elif method == "xxhash64":
with nogil:
c_result = move(xxhash_64(c_source_view, seed))
return Column.from_pylibcudf(sha512(ctbl))
else:
raise ValueError(f"Unsupported hash function: {method}")
return Column.from_unique_ptr(move(c_result))
raise ValueError(
f"Unsupported hashing algorithm {method}."
)
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/pylibcudf/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ set(cython_sources
filling.pyx
gpumemoryview.pyx
groupby.pyx
hashing.pyx
interop.pyx
join.pyx
lists.pyx
Expand Down
2 changes: 2 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/__init__.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ from . cimport (
copying,
filling,
groupby,
hashing,
join,
lists,
merge,
Expand Down Expand Up @@ -40,6 +41,7 @@ __all__ = [
"filling",
"gpumemoryview",
"groupby",
"hashing",
"join",
"lists",
"merge",
Expand Down
2 changes: 2 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
copying,
filling,
groupby,
hashing,
interop,
join,
lists,
Expand Down Expand Up @@ -39,6 +40,7 @@
"filling",
"gpumemoryview",
"groupby",
"hashing",
"interop",
"join",
"lists",
Expand Down
34 changes: 34 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/hashing.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Copyright (c) 2024, NVIDIA CORPORATION.

from libc.stdint cimport uint32_t, uint64_t
from cudf._lib.cpp.types cimport size_type
from cudf._lib.cpp.table cimport table
from libcpp.vector cimport vector
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved

#from cudf._lib.cpp.hash cimport hash_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.


from .column cimport Column
from .table cimport Table


cpdef Column murmurhash3_x86_32(
Table input,
uint32_t seed
)

cpdef Table murmurhash3_x64_128(
Table input,
uint64_t seed
)

cpdef Column md5(Table input)
cpdef Column sha1(Table input)
cpdef Column sha224(Table input)
cpdef Column sha256(Table input)
cpdef Column sha384(Table input)
cpdef Column sha512(Table input)

cpdef Column xxhash_64(
Table input,
uint64_t seed
)
105 changes: 105 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/hashing.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Copyright (c) 2024, NVIDIA CORPORATION.
from libc.stdint cimport uint32_t, uint64_t
from libcpp.memory cimport unique_ptr
from libcpp.utility cimport move

from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.hash cimport (
md5 as cpp_md5,
murmurhash3_x64_128 as cpp_murmurhash3_x64_128,
murmurhash3_x86_32 as cpp_murmurhash3_x86_32,
sha1 as cpp_sha1,
sha224 as cpp_sha224,
sha256 as cpp_sha256,
sha384 as cpp_sha384,
sha512 as cpp_sha512,
xxhash_64 as cpp_xxhash_64,
)
from cudf._lib.cpp.table.table cimport table

from .column cimport Column
from .table cimport Table


cpdef Column murmurhash3_x86_32(
Table input,
uint32_t seed
):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(
cpp_murmurhash3_x86_32(
input.view(),
seed
)
)

return Column.from_libcudf(move(c_result))

cpdef Table murmurhash3_x64_128(
Table input,
uint64_t seed
):
cdef unique_ptr[table] c_result
with nogil:
c_result = move(
cpp_murmurhash3_x64_128(
input.view(),
seed
)
)

return Table.from_libcudf(move(c_result))


cpdef Column xxhash_64(
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
Table input,
uint64_t seed
):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(
cpp_xxhash_64(
input.view(),
seed
)
)

return Column.from_libcudf(move(c_result))


brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
cpdef Column md5(Table input):
cdef unique_ptr[column] c_result
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
with nogil:
c_result = move(cpp_md5(input.view()))
return Column.from_libcudf(move(c_result))

cpdef Column sha1(Table input):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(cpp_sha1(input.view()))
return Column.from_libcudf(move(c_result))

cpdef Column sha224(Table input):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(cpp_sha224(input.view()))
return Column.from_libcudf(move(c_result))

cpdef Column sha256(Table input):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(cpp_sha256(input.view()))
return Column.from_libcudf(move(c_result))

cpdef Column sha384(Table input):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(cpp_sha384(input.view()))
return Column.from_libcudf(move(c_result))

cpdef Column sha512(Table input):
cdef unique_ptr[column] c_result
with nogil:
c_result = move(cpp_sha512(input.view()))
return Column.from_libcudf(move(c_result))
1 change: 0 additions & 1 deletion python/cudf/cudf/pylibcudf_tests/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ def assert_column_eq(plc_column: plc.Column, pa_array: pa.Array) -> None:
plc_pa = plc_pa.combine_chunks()
if isinstance(pa_array, pa.ChunkedArray):
pa_array = pa_array.combine_chunks()

assert plc_pa.equals(pa_array)


Expand Down
24 changes: 24 additions & 0 deletions python/cudf/cudf/pylibcudf_tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@

from utils import DEFAULT_STRUCT_TESTING_TYPE

import cudf._lib.pylibcudf as plc


# This fixture defines the standard set of types that all tests should default to
# running on. If there is a need for some tests to run on a different set of types, that
Expand All @@ -29,3 +31,25 @@
)
def pa_type(request):
return request.param


# TODO: Test nullable data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please open an issue for this TODO?

@pytest.fixture(scope="session")
def pa_input_column(pa_type):
if pa.types.is_integer(pa_type) or pa.types.is_floating(pa_type):
return pa.array([1, 2, 3], type=pa_type)
elif pa.types.is_string(pa_type):
return pa.array(["a", "b", "c"], type=pa_type)
elif pa.types.is_boolean(pa_type):
return pa.array([True, True, False], type=pa_type)
elif pa.types.is_list(pa_type):
# TODO: Add heterogenous sizes
return pa.array([[1], [2], [3]], type=pa_type)
elif pa.types.is_struct(pa_type):
return pa.array([{"v": 1}, {"v": 2}, {"v": 3}], type=pa_type)
raise ValueError("Unsupported type")


@pytest.fixture(scope="session")
def input_column(pa_input_column):
return plc.interop.from_arrow(pa_input_column)
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading