Closed
Description
When grouping a DataFrame on multiple columns, and one of the columns contains a NaN value, the row will be excluded from the result, this is already documented here: #3729 and has an open issue.
But the .groups attribute includes these NaN rows, which is inconsistent with the groupby results.
See this simple example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6'], 'c': [7, 8, 9]})
df.groupby(['a', 'b']).groups
Out[75]:
{('1', '4'): Int64Index([0], dtype='int64'),
('2', nan): Int64Index([1], dtype='int64'), ##### Is excluded when iterating the results
('3', '6'): Int64Index([2], dtype='int64')}
{k: g.index for k, g in df.groupby(['a', 'b'])}
Out[76]:
{('1', '4'): Int64Index([0], dtype='int64'),
('3', '6'): Int64Index([2], dtype='int64')}