Skip to content

BUG: df.stack() returns wrong data when NaT is in index (regression since 2.1.0, ok in <= 2.0.3) #57152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
behrenhoff opened this issue Jan 30, 2024 · 3 comments
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@behrenhoff
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    data=[[1, 2, 3]],
    columns=pd.MultiIndex.from_tuples(
        [
            ("MAT", pd.Timestamp("2021-12-01"), "a"),
            ("ignore", pd.Timestamp("1970-12-01"), "a"),
            ("ignore", pd.NaT, "a"),
        ],
        names=("date_type", "date", "value_type"),
    ),
)

unique_dates_v1 = df.columns.get_level_values("date")[
    df.columns.get_level_values("date_type") == "MAT"
].unique()

unique_dates_via_stack = (
    df.stack(df.columns.names)
    .xs("MAT", level="date_type")
    .index.get_level_values("date")
    .unique()
)

print(pd.__version__)
print("v1", unique_dates_v1)
print("v2", unique_dates_via_stack)

assert all(unique_dates_v1 == pd.Timestamp("2021-12-01"))
assert all(unique_dates_via_stack == pd.Timestamp("2021-12-01"))
assert unique_dates_v1.equals(unique_dates_via_stack)
print("all ok")

Issue Description

First of all, sorry for the rather complex dataframe. It was already quite challenging to reduce it from the one I was actually using...

Let us consider a DataFrame with a column MultiIndex where a NaT happens to appear in one of the indexes.

Let's try to find out the timestamps where date_type == "MAT". This can be done in two ways:
a) unique_dates_v1: here it is a simple cut using get_level_values - works fine
b) unique_dates_via_stack: by stacking all the columns, thus making a series where a cross section can then give us the result. This is the version failing from pandas >= 2.1.0

I know there is future_stack=True in newer pandas versions - and the future_stack seems to work fine (and is usually what I prefer). However, the error above was caused when migrating older code. The stack version simply returns wrong data. There is no MAT entry at all with a 1970 date. Even if the old stack variant introduces additional NaNs, it should never return wrong data, not even in a deprecated stack implementation.

Expected Behavior

behavior as in pandas 2.0, i.e. not assigning wrong data to MAT

Installed Versions

works fine with pandas <= 2.0.3
fails with pandas >= 2.1.0

@behrenhoff behrenhoff added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 30, 2024
@behrenhoff
Copy link
Contributor Author

behrenhoff commented Jan 30, 2024

The output in Pandas 2.2 is

(...some warnings: about future_stack and performance...)
2.2.0
v1 DatetimeIndex(['2021-12-01'], dtype='datetime64[ns]', name='date', freq=None)
v2 DatetimeIndex(['1970-12-01'], dtype='datetime64[ns]', name='date', freq=None)
Traceback (most recent call last):
  File "/home/behrenhoff/stack-test.py", line 31, in <module>
    assert all(unique_dates_via_stack == pd.Timestamp("2021-12-01"))
AssertionError

v2 should never return the 1970 timestamp, but always the 2021 timestamp.

In Pandas 2.0.3 the output is

2.0.3
v1 DatetimeIndex(['2021-12-01'], dtype='datetime64[ns]', name='date', freq=None)
v2 DatetimeIndex(['2021-12-01'], dtype='datetime64[ns]', name='date', freq=None)
all ok

@rhshadrach
Copy link
Member

Thanks for the report. Indeed, not using future_stack=True is now deprecated and will be removed in pandas 3.0. I'm not sure I see much value in spending effort to fix that which will be removed. But if anyone does want to put in the effort and is able to fix, PRs are welcome!

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 30, 2024
@rhshadrach rhshadrach added this to the 2.2.1 milestone Jan 30, 2024
@behrenhoff
Copy link
Contributor Author

Sure, I was just wondering why the behavior has changed at all when the old way is deprecated... This one is particular dangerous because it returns a data pair ("MAT" with "1970-12-01") that didn't exist in the original. This should never happen.

In case it is decided against a fix (which I could understand), I'd at least suggest a warning that stacking nan or nat in the index results in undefined behavior - I've happily ignored the existing performance and future warnings after the pandas upgrade. I did certainly not expect this value pair.

Also, the future_stack behavior works at least in this specific case though I did NOT test the future_stack=True extensively. Maybe a few tests are missing here? Tests with Timestamps including NaT in index, and in MultiIndex, ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants