-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate hashing operations to pylibcudf
#15418
Migrate hashing operations to pylibcudf
#15418
Conversation
Ah - pushed this from my workstation where I guess I don't have signing set up. Should be able to fix this. |
from cudf._lib.cpp.table cimport table | ||
from libcpp.vector cimport vector | ||
|
||
#from cudf._lib.cpp.hash cimport hash_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed.
f0a8695
to
c3c03be
Compare
Signed-off-by: brandon-b-miller <brmiller@nvidia.com>
Signed-off-by: brandon-b-miller <brmiller@nvidia.com>
c3c03be
to
f4c953c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving C++ changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good.
@pytest.fixture(scope="module") | ||
def input_column(pa_input_column): | ||
return plc.interop.from_arrow(pa_input_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to delete this too if we do move the fixtures to conftest.
Co-authored-by: Bradley Dice <bdice@bradleydice.com> Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
After some picking apart the libcudf source code I was able to construct tests for list and struct that pass. The tests attempt to be as faithful as possible to the libcudf implementation including some separate extra mixing both types appear to do to the hash somewhat differently. |
/ok to test |
/ok to test |
I reviewed the diff since my last pass of review, and approved to help move this forward. The code quality seems sufficient (the most possible cleanup / refactoring is in tests), and we don't want to let this drag out forever. Thanks for your persistence @brandon-b-miller! |
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
/ok to test |
Unless I missed something, the test failures on the last commit where tests ran looked semi-legit. Not like something is wrong in our core hashing impl, but we might need to do a bit of work to get things wrapped up nicely in pyarrow objects. |
Strange. Appears to be both the python 3.10 tests. The error indicates the hash value coming out of the python impl isn't fitting in the |
It's failing on "oldest dependencies" so this could be a difference in older/newer PyArrow versions -- not necessarily Python 3.10. |
Yes, that seems very likely. If you can narrow it down to a specific pyarrow version and the fix isn't easy, I'd be fine with xfailing the test on the unsupported pyarrow version. We have a couple of tests like that already (see #16681). |
I don't think the issue is with pyarrow. It's failing to construct the output array, but the reason is that it's trying to fit the number I suspect numpy and something having changed with |
Turns out numpy was indeed the problem. >>> np.__version__
'2.0.2'
>>> type(np.uint32(0) >> 2)
<class 'numpy.uint32'> vs >>> np.__version__
'1.26.0'
>>> type(np.uint32(0) >> 2)
<class 'numpy.int64'> Having a little trouble narrowing down exactly when this changed, but in any event enforcing uint32 in a few more places seems to fix the issue. |
/ok to test |
Casting behavior changed significantly in NumPy 2 (for the better, imo). See the numpy 2.0.0 release notes and NEP 50 about dtype promotion rules. |
Docs build is not finding all the links. @vyasr have your concerns been addressed? |
I think this might have something to do with the c++ hashing APIs not showing up in our sphinx docs. I see a top level namespace with a short description listed but I don't see the individual hashing APIs anywhere. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this PR now, assuming we can get the docs build passing. The current issues are because the DEFAULT_HASH_SEED
constant is defined outside of the @addtogroup column_hash
in hashing.hpp, so it is not included in the group and therefore you have APIs in the group referencing an unknown constant. The same goes for the hash_value_type
.
The pylibcudf doc failures are because (don't ask me why) Sphinx uses namespaced names for C++ types (e.g. cudf::aggregation
) but it does not use the namespaces for the functions. You should be able to get them to resolve by remove the cpp::hashing::
prefixes.
/ok to test |
/merge |
This PR creates
pylibcudf
hashing APIs and modifies the cuDF Cython to leverage them. cc @vyasr