Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

Open
rhshadrach opened this issue Jun 4, 2023 · 1 comment
Open

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

rhshadrach opened this issue Jun 4, 2023 · 1 comment
Labels
Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rhshadrach
Copy link
Member

Currently dropna is used in four places within DataFrame.pivot_table:

  1. It takes the cartesian product of all index/column levels when there are multiple levels; this was the original use
  2. It is passed through to groupby
  3. After the groupby aggregation, any rows that are all null are dropped
  4. When computing the margins, rows in the original data where the keys and values are all null are dropped

1, 2, and 4 were all implemented for crosstab, which is essentially a call to pivot_table.

The API docs for crosstab document the dropna argument as:

Do not include columns whose entries are all NaN.

The only other documentation in the API and User Guide mentions using dropna=False to include rows/columns for categorical data with missing categorical values.

I think this is too much for a single Boolean argument to handle. I propose the following:

a. Add cartesian_product=[True|False] to pivot_table and crosstab
b. Add observed=[True|False] to crosstab for use with categoricals
c. Deprecate behavior (1) (with dropna), (3), and (4) above. The user may do each of these by dropping null values from the input data if they so desire.

We can implement (c) without affecting the behavior of crosstab by changing the data there to be a mixture of null/non-null values depending on the input and using the aggregation count instead of len.

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jun 4, 2023
@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 24, 2023

Instead of adding cartesian_product, we could make observed=False take the cartesian product of both the index and the column, regardless of whether they are categorical. In my opinion, this would only be okay to do if #55261 gets implemented.

Doing this would mean more than just passing observed=False through to groupby - that would only handle the index. We would still need to take the cartesian product of the columns in the case they are a MultiIndex as is done with dropna=False today.

If we're going this route, then I think we should also adhere to groupby semantics with unobserved groupings for various ops. For example:

df = pd.DataFrame(
    {
        'idx1': 1,
        'idx2': [2, 3],
        'col1': 4,
        'col2': [5, 6],
        'val1': [7, 8],
        'val2': [9, 10],
    }
)
df.pivot_table(index=['idx1', 'idx2'], columns=['col1', 'col2'], values=['val1', 'val2'], aggfunc='sum', dropna=False)

currently results in NaN values in various locations; instead with observed=False it should be 0 because the specified aggfunc is sum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

1 participant