Skip to content

BUG: concat gives incorrect result when MultiIndex values are all NA #47802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
RobinFiveWords opened this issue Jul 20, 2022 · 3 comments
Closed
2 of 3 tasks
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@RobinFiveWords
Copy link

RobinFiveWords commented Jul 20, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

idx1 = pd.MultiIndex.from_arrays([[1.0],
                                  [2.0]],
                                 names=['a', 'b'])
ser1 = pd.Series([1], index=idx1, name='count1')

idx2 = pd.MultiIndex.from_arrays([[pd.NA],
                                  [pd.NA]],
                                 names=['a', 'b'])
ser2 = pd.Series([1], index=idx2, name='count2')

print(pd.concat((ser1, ser2), axis=1))
print()
print(pd.concat((ser2, ser1), axis=1))
print()
print(pd.__version__)

#          count1  count2
# a   b                  
# 1.0 2.0     1.0       1
# NaN NaN     NaN       1
# 
#          count2  count1
# a   b                  
# NaN NaN       1     NaN
# 1.0 2.0       1     1.0
# 
# 1.4.2

Issue Description

concat doesn't correctly join where all levels of a MultiIndex are NA. In version 1.4.2 (later confirmed in v.1.4.3), concat "over-matches" these all-NA rows to other rows.

I feel like there is previous discussion online of all-NA rows of a MultiIndex but I was unable to find it.

Please note:

  • Version 1.3.5 gives a different wrong result for the second concat, only returning the row with all-NA index values.
  • The wrong result occurs when the MultiIndex has only one level, and its value is NA, but I thought using a one-level MultiIndex in the example would confuse rather than simplify the issue. The correct result occurs with an Index that is not a MultiIndex.

Expected Behavior

The correct behavior would result in this:

print(pd.concat((ser1, ser2), axis=1))
#          count1  count2
# a   b                  
# 1.0 2.0     1.0     NaN
# NaN NaN     NaN     1.0

Installed Versions

This was run in the interactive shell at

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.2.final.0
python-bits : 32
OS : Emscripten
OS-release : 1.0
Version : #1
machine : wasm32
byteorder : little
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 62.0.0
IPython : 8.4.0
matplotlib : 3.5.1

@RobinFiveWords RobinFiveWords added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 20, 2022
@RobinFiveWords
Copy link
Author

RobinFiveWords commented Jul 20, 2022

A workaround is to replace the NAs in the MultiIndex with some unique value, do the concat, and then put the NAs back in.

From skimming concat.py I get the impression that the join keys are being transcribed into a new list (I'm using list informally here), and perhaps something about this causes an all-NA list to satisfy the join criteria when it should be seen as different. Also, if something in how the keys are determined or the join is done was changed between 1.3 and 1.4, that could explain the different wrong results.

@RobinFiveWords
Copy link
Author

RobinFiveWords commented Jul 20, 2022

pd.MultiIndex.from_frame(pd.DataFrame({'a': [pd.NA]})).levels
# FrozenList([[]])

Is this the issue, that levels shows an empty list rather than a value on which to match?

(Is this the same issue as #30750?)

@mroeschke
Copy link
Member

(Is this the same issue as #30750?)

Yeah it appears #30750 is the core issue that affects this operation. Closing to close this issue to centralize the discussion to that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants