Skip to content

[FEA] Add optional support for SQL style groupby null in key column handling #2627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
randerzander opened this issue Aug 19, 2019 · 2 comments · Fixed by #2629
Closed

[FEA] Add optional support for SQL style groupby null in key column handling #2627

randerzander opened this issue Aug 19, 2019 · 2 comments · Fixed by #2629
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

randerzander commented Aug 19, 2019

Currently when you run a groupby, any nulls in the key column are dropped from the output:

import cudf

df = cudf.DataFrame()

df['id'] = [0, 1, 1, None, None, 3, 3]
df['val'] = [0, 1, 1, 2, 2, 3, 3]
df.groupby('id').val.sum()
id
0    0
1    2
3    6
Name: val, dtype: int64

This has the same behavior as Pandas, but sometimes it's desired to keep the null values as rows in the output. We should consider adding a dropna param which lets users control this behavior.

We can do this via a workaround with a sentinel and fillna and replace, but this is annoying to manage, and lowers overall workflow perf:

df['id'] = df['id'].fillna(-1)
res = df.groupby('id').val.sum().reset_index()
res['id'] = res['id'].replace(-1, None)
res
  | id | val
-- | -- | --
null | 4
0 | 0
1 | 2
3 | 6

EDIT - Changed suggested param from keep_nulls to dropna to match Pandas's API per Keith's comment.

@randerzander randerzander added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Aug 19, 2019
@jrhemstad
Copy link
Contributor

The C++ implementation already supports this behavior with the ignore_null_keys option:
https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/groupby.hpp#L74

It's just a matter of exposing this in the Python API.

@jrhemstad jrhemstad removed the libcudf Affects libcudf (C++/CUDA) code. label Aug 19, 2019
@shwina shwina self-assigned this Aug 19, 2019
@kkraus14
Copy link
Collaborator

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: pandas-dev/pandas#21669

I'd argue we should do the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants