[FEA] Add optional support for SQL style groupby null in key column handling #2627

randerzander · 2019-08-19T16:18:25Z

Currently when you run a groupby, any nulls in the key column are dropped from the output:

import cudf

df = cudf.DataFrame()

df['id'] = [0, 1, 1, None, None, 3, 3]
df['val'] = [0, 1, 1, 2, 2, 3, 3]
df.groupby('id').val.sum()

id
0    0
1    2
3    6
Name: val, dtype: int64

This has the same behavior as Pandas, but sometimes it's desired to keep the null values as rows in the output. We should consider adding a dropna param which lets users control this behavior.

We can do this via a workaround with a sentinel and fillna and replace, but this is annoying to manage, and lowers overall workflow perf:

df['id'] = df['id'].fillna(-1)
res = df.groupby('id').val.sum().reset_index()
res['id'] = res['id'].replace(-1, None)
res

  | id | val
-- | -- | --
null | 4
0 | 0
1 | 2
3 | 6

EDIT - Changed suggested param from keep_nulls to dropna to match Pandas's API per Keith's comment.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2019-08-19T16:20:20Z

The C++ implementation already supports this behavior with the ignore_null_keys option:
https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/groupby.hpp#L74

It's just a matter of exposing this in the Python API.

kkraus14 · 2019-08-19T16:32:55Z

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: pandas-dev/pandas#21669

I'd argue we should do the same.

randerzander added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Aug 19, 2019

jrhemstad removed the libcudf Affects libcudf (C++/CUDA) code. label Aug 19, 2019

shwina self-assigned this Aug 19, 2019

shwina mentioned this issue Aug 19, 2019

Add groupby dropna= parameter #2629

Merged

kkraus14 closed this as completed in #2629 Aug 20, 2019

randerzander mentioned this issue Nov 5, 2019

[FEA] Make dask_cudf support SQL style "null" values in groupby key columns #3302

Closed

mrocklin mentioned this issue Nov 6, 2019

Add dropna keyword to groupby aggregations dask/dask#5565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add optional support for SQL style groupby null in key column handling #2627

[FEA] Add optional support for SQL style groupby null in key column handling #2627

randerzander commented Aug 19, 2019 •

edited

Loading

jrhemstad commented Aug 19, 2019

Uh oh!

kkraus14 commented Aug 19, 2019

Uh oh!

[FEA] Add optional support for SQL style groupby null in key column handling #2627

[FEA] Add optional support for SQL style groupby null in key column handling #2627

Comments

randerzander commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jrhemstad commented Aug 19, 2019

Uh oh!

kkraus14 commented Aug 19, 2019

Uh oh!

randerzander commented Aug 19, 2019 •

edited

Loading