Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate hashing operations to pylibcudf #15418

Merged
merged 69 commits into from
Oct 31, 2024

Conversation

brandon-b-miller
Copy link
Contributor

This PR creates pylibcudf hashing APIs and modifies the cuDF Cython to leverage them. cc @vyasr

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Apr 1, 2024
@brandon-b-miller brandon-b-miller self-assigned this Apr 1, 2024
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner April 1, 2024 16:48
Copy link

copy-pr-bot bot commented Apr 1, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the CMake CMake build issue label Apr 1, 2024
@brandon-b-miller
Copy link
Contributor Author

Ah - pushed this from my workstation where I guess I don't have signing set up. Should be able to fix this.

python/cudf/cudf/_lib/cpp/hash.pxd Outdated Show resolved Hide resolved
from cudf._lib.cpp.table cimport table
from libcpp.vector cimport vector

#from cudf._lib.cpp.hash cimport hash_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

python/cudf/cudf/_lib/pylibcudf/hashing.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/hashing.pyx Outdated Show resolved Hide resolved
Signed-off-by: brandon-b-miller <brmiller@nvidia.com>
Signed-off-by: brandon-b-miller <brmiller@nvidia.com>
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner April 3, 2024 16:09
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 3, 2024
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving C++ changes.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good.

python/cudf/cudf/_lib/cpp/hash.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/hashing.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/hashing.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/hashing.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/conftest.py Outdated Show resolved Hide resolved
Comment on lines 21 to 23
@pytest.fixture(scope="module")
def input_column(pa_input_column):
return plc.interop.from_arrow(pa_input_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to delete this too if we do move the fixtures to conftest.

python/cudf/cudf/pylibcudf_tests/test_hashing.py Outdated Show resolved Hide resolved
brandon-b-miller and others added 2 commits April 5, 2024 08:05
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@brandon-b-miller
Copy link
Contributor Author

After some picking apart the libcudf source code I was able to construct tests for list and struct that pass. The tests attempt to be as faithful as possible to the libcudf implementation including some separate extra mixing both types appear to do to the hash somewhat differently.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

python/pylibcudf/pylibcudf/tests/conftest.py Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Oct 28, 2024

I reviewed the diff since my last pass of review, and approved to help move this forward. The code quality seems sufficient (the most possible cleanup / refactoring is in tests), and we don't want to let this drag out forever.

Thanks for your persistence @brandon-b-miller!

brandon-b-miller and others added 2 commits October 28, 2024 17:29
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

Unless I missed something, the test failures on the last commit where tests ran looked semi-legit. Not like something is wrong in our core hashing impl, but we might need to do a bit of work to get things wrapped up nicely in pyarrow objects.

@brandon-b-miller
Copy link
Contributor Author

Unless I missed something, the test failures on the last commit where tests ran looked semi-legit. Not like something is wrong in our core hashing impl, but we might need to do a bit of work to get things wrapped up nicely in pyarrow objects.

Strange. Appears to be both the python 3.10 tests. The error indicates the hash value coming out of the python impl isn't fitting in the uint32 output array. Will need to spin up local CI and figure out what's going on.

@bdice
Copy link
Contributor

bdice commented Oct 28, 2024

It's failing on "oldest dependencies" so this could be a difference in older/newer PyArrow versions -- not necessarily Python 3.10.

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

Yes, that seems very likely. If you can narrow it down to a specific pyarrow version and the fix isn't easy, I'd be fine with xfailing the test on the unsupported pyarrow version. We have a couple of tests like that already (see #16681).

@brandon-b-miller
Copy link
Contributor Author

I don't think the issue is with pyarrow. It's failing to construct the output array, but the reason is that it's trying to fit the number 6427104240 inside an array that is being typed as uint32. This number is coming out of hash_list and hash_struct.

I suspect numpy and something having changed with tobytes between 1.23.5 and 2.1.2

@brandon-b-miller
Copy link
Contributor Author

Turns out numpy was indeed the problem.

>>> np.__version__
'2.0.2'
>>> type(np.uint32(0) >> 2)
<class 'numpy.uint32'>

vs

>>> np.__version__
'1.26.0'
>>> type(np.uint32(0) >> 2)
<class 'numpy.int64'>

Having a little trouble narrowing down exactly when this changed, but in any event enforcing uint32 in a few more places seems to fix the issue.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@bdice
Copy link
Contributor

bdice commented Oct 30, 2024

Having a little trouble narrowing down exactly when this changed, but in any event enforcing uint32 in a few more places seems to fix the issue.

Casting behavior changed significantly in NumPy 2 (for the better, imo). See the numpy 2.0.0 release notes and NEP 50 about dtype promotion rules.

@wence-
Copy link
Contributor

wence- commented Oct 30, 2024

Docs build is not finding all the links.

@vyasr have your concerns been addressed?

@brandon-b-miller
Copy link
Contributor Author

Docs build is not finding all the links.

I think this might have something to do with the c++ hashing APIs not showing up in our sphinx docs. I see a top level namespace with a short description listed but I don't see the individual hashing APIs anywhere.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this PR now, assuming we can get the docs build passing. The current issues are because the DEFAULT_HASH_SEED constant is defined outside of the @addtogroup column_hash in hashing.hpp, so it is not included in the group and therefore you have APIs in the group referencing an unknown constant. The same goes for the hash_value_type.

The pylibcudf doc failures are because (don't ask me why) Sphinx uses namespaced names for C++ types (e.g. cudf::aggregation) but it does not use the namespaces for the functions. You should be able to get them to resolve by remove the cpp::hashing:: prefixes.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a69de57 into rapidsai:branch-24.12 Oct 31, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants