Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

vyasr · 2023-11-02T22:41:44Z

In pandas, Series.unique returns a numpy array (for non-extension types) while Series.drop_duplicates returns a Series. The two results should otherwise contain the same set of values. In cudf, historically both methods returned a Series, and at these stages in cuml's pipeline it knows that it is working with cudf objects. However, if cudf has pandas compatibility mode enabled, then unique will return an array to match pandas behavior. In this scenario, the method chaining no longer works because cupy is calling methods on the result of unique assuming that it will be a Series. To fix this, cuml needs to call drop_duplicates instead.

… mode

csadorf

In principle all good, however, I tried to check whether y is guaranteed to be a data frame at this point and was wondering whether there is a reason that we don't just deduplicate directly.

python/cuml/preprocessing/LabelEncoder.py

vyasr · 2023-11-03T19:32:46Z

/merge

@csadorf

I accidentally committed but forgot to push some changes requested by @csadorf in #5639. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Simon Adorf (https://github.com/csadorf) Approvers: - Simon Adorf (https://github.com/csadorf) URL: #5648

Use drop_duplicates instead of unique for cudf's pandas compatibility…

f4de575

… mode

vyasr added bug Something isn't working 3 - Ready for Review Ready for review by team Cython / Python Cython or Python issue non-breaking Non-breaking change labels Nov 2, 2023

vyasr self-assigned this Nov 2, 2023

vyasr requested a review from a team as a code owner November 2, 2023 22:41

csadorf approved these changes Nov 3, 2023

View reviewed changes

python/cuml/preprocessing/LabelEncoder.py Show resolved Hide resolved

rapids-bot bot merged commit 0296043 into rapidsai:branch-23.12 Nov 3, 2023
53 checks passed

vyasr mentioned this pull request Nov 7, 2023

Simplify some logic in LabelEncoder #5648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

vyasr commented Nov 2, 2023

csadorf left a comment

vyasr commented Nov 3, 2023

Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

Conversation

vyasr commented Nov 2, 2023

csadorf left a comment

Choose a reason for hiding this comment

vyasr commented Nov 3, 2023