-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) #16712
Conversation
return array.unique.to_pandas() | ||
elif isinstance(array, (str, numbers.Number)): | ||
return [array] | ||
yield from as_column(array).unique().values_host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance question: Do we want to run the unique()
on GPU? These are columns and not rows, right? Kernel launch latency may exceed the time to run that unique step on CPU, if we expect this to be small.
I'm okay with running it on GPU if there's any uncertainty or if there's case-by-case decisions/tradeoffs we would need to consider, just want to be sure we're not making a uniformly bad performance decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are columns and not rows, right?
Yeah these are just column labels to drop that happen to be GPU backed.
I agree it might be worth doing this on the CPU instead. I'd assume len(columns to drop) << len(columns)
and we convert to host anyways to iterate over these labels to drop, so we might as well do the unique
step there too
@@ -150,24 +149,14 @@ | |||
) | |||
|
|||
|
|||
def _get_host_unique(array): | |||
def _get_unique_drop_labels(array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of this being a generator? You could just return an iterable rather than yield from
it if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably negligible in the context of .drop
, but it was to avoid a case where array
was a scalar so we were converting scalar -> iterable (_get_unique_drop_labels) -> scalar (frame._drop_column(scalar))
. I can change it back to make this _get_unique_drop_labels
return an iterable if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll leave the choice to you. Just noting that yield from
patterns tend to be dangerous for performance in cudf because host-device data copying is often involved.
/merge |
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error. Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#16712
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error. Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#16712
Description
Before when
columns=
was acudf.Series/Index
we would callreturn array.unique.to_pandas()
, but.unique
is a method not a property so this would have raised an error.Also took the time to refactor the helper methods here and push down the
errors=
keyword toFrame._drop_column
Checklist