Skip to content

Inconsistent results for groupby on multiple columns with NaN value #17445

Closed
@linar-jether

Description

@linar-jether

When grouping a DataFrame on multiple columns, and one of the columns contains a NaN value, the row will be excluded from the result, this is already documented here: #3729 and has an open issue.

But the .groups attribute includes these NaN rows, which is inconsistent with the groupby results.

See this simple example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6'], 'c': [7, 8, 9]})

df.groupby(['a', 'b']).groups
Out[75]: 
{('1', '4'): Int64Index([0], dtype='int64'),
 ('2', nan): Int64Index([1], dtype='int64'), ##### Is excluded when iterating the results
 ('3', '6'): Int64Index([2], dtype='int64')}


{k: g.index for k, g in df.groupby(['a', 'b'])}
Out[76]: 
{('1', '4'): Int64Index([0], dtype='int64'),
 ('3', '6'): Int64Index([2], dtype='int64')}

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions