-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "collect" aggregation support to dask-cudf #15593
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor requests for clarification in the doc aspect.
if as_index is not True: | ||
raise NotImplementedError( | ||
f"`as_index` is not supported by dask-expr. Please disable " | ||
"query planning, or reset the index after aggregating.\n" | ||
f"{_LEGACY_WORKAROUND}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is a somewhat confusing error message. The only way to get past it with query planning enabled is to say as_index=True
, but the error message seems to say "as_index=True` is not handled by dask-expr.
Do you mean:
dask-expr
only supportsas_index=True
. Foras_index=False
either disable query planning or reset the index withreset_index
after aggregating.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like dask-expr doesn't actually have support for the as_index
keyword arg in general and always follows the behavior of as_index=False
, so perhaps we should consider:
- checking if the kwarg is provided at all, emitting a
FutureWarning
if so - if the kwarg isn't
as_index=True
, raise the error description suggested above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream dask.dataframe
API has always raised an error when as_index
is used, but dask-cudf has used a distinct groupby API until now.
I agree that the real problem is that as_index
is not supported at all by dask-expr. Therefore @charlesbluca's suggestion is probably the most "correct". With that said, I'm feeling a bit hesitant to add more noise for something that technically "works fine" :/
if "as_index" in kwargs: | ||
warnings.warn( | ||
"The `as_index` argument is no longer supported in " | ||
"dask-cudf when query-planning is enabled.", | ||
FutureWarning, | ||
) | ||
|
||
if kwargs.pop("as_index", True) is not True: | ||
raise NotImplementedError( | ||
f"`as_index=False` is not supported. Please disable " | ||
"query planning, or reset the index after aggregating.\n" | ||
f"{_LEGACY_WORKAROUND}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially does what @charlesbluca suggested - The message is still slightly confusing, but so is the behavior I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something to the effect of "as_index
is not supported with query-planning enabled this will behave consistently with as_index=True
" for both messages, with an additional blurb for the error showing the non-true value passed by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, looks good. I agree the message is still a bit confusing but I do not obviously have a better suggestion :(
I really appreciate the reviews @wence- @charlesbluca ! I just cleanup up the messaging a bit - Definitely not perfect, but hopefully clear enough for most users. |
/merge |
Currently blocked by dask/dask#11064Description
This PR
(along with it's upstream dependency)enables"collect"
aggregations in dask-cudf when query-planning is enabled. It also adds an clearer error message foras_index
usage (which is not supported in dask-dataframe, but was supported in legacy dask-cudf)Checklist