Description
Thank you so much for your work on pandas 🤗
I don't sufficiently understand pd.concat
and MultiIndex (yet!) to explain what is going on here, so I hope someone more knowledgeable can shed light on these curious symptoms:
Code Sample
import pandas as pd
from numpy import nan
# Build some Series, where one has NaN in the index
left = pd.Series(["left-a", "left-nan", "left-c"], index=["a", nan, "c"], name="left")
# a left-a
# NaN left-nan
# c left-c
# Name: left, dtype: object
right = pd.Series(["right-a", "right-b", "right-d"], index=["a", "b", "d"], name="right")
# a right-a
# b right-b
# d right-d
# Name: right, dtype: object
# Concat'ing with the regular Index
single = pd.concat([left, right], axis=1, join="outer", sort=False).sort_index()
# left right
# a left-a right-a
# b NaN right-b
# c left-c NaN
# d NaN right-d
# NaN left-nan NaN
# `left_multi` and `right_multi` look just like `left` and `right`,
# and only differ in that they have a 1-level MultiIndex
left_multi = left.set_axis(pd.MultiIndex.from_arrays([left.index]), inplace=False)
right_multi = right.set_axis(pd.MultiIndex.from_arrays([right.index]), inplace=False)
# Concat'ing with the 1-level MultiIndex
multi = pd.concat([left_multi, right_multi], axis=1, join="outer", sort=False).sort_index()
# left right
# a left-a right-a
# b left-nan right-b <--- surprise left-nan!
# c left-c NaN
# d left-nan right-d <--- surprise left-nan!
# NaN left-nan NaN
The same results can be achieved with DataFrames in (at least) these other ways, so in that way it is very consistent 😀
left_frame = left.to_frame()
left_multi_frame = left_multi.to_frame()
right_frame = right.to_frame()
right_multi_frame = right_multi.to_frame()
# DataFrame.join
s_join = left_frame.join(right_frame, how="outer").sort_index()
assert single.equals(s_join)
m_join = left_multi_frame.join(right_multi_frame, how="outer").sort_index()
assert multi.equals(m_join)
# DataFrame.merge
s_merge = left_frame.merge(right_frame, left_index=True, right_index=True, how="outer").sort_index()
assert single.equals(s_merge)
m_merge = left_multi_frame.merge(right_multi_frame, left_index=True, right_index=True, how="outer").sort_index()
assert multi.equals(m_merge)
# pd.concat on DataFrames
s_concat = pd.concat([left_frame, right_frame], axis=1, join="outer", sort=False).sort_index()
assert single.equals(s_concat)
m_concat = pd.concat([left_multi_frame, right_multi_frame], axis=1, join="outer", sort=False).sort_index()
assert multi.equals(m_concat)
Problem description
I am surprised to see NaNs being filled out for left
when the index is a MultiIndex. This is because I would expect the same underlying logic for joining on MultiIndex as on a regular Index: Either NaN counts as a match-all value (MultiIndex behaviour), or NaN just matches the other
's NaN like a regular value (regular Index behaviour).
I failed to find docs describing either concat'ing of MultiIndex with NaNs or regular Index with NaNs, so I am unsure if this is intended behaviour.
Expected Output
I expect the two approaches to have the same output, be it NaN as a match-all or NaN as a regular value.
Output of pd.show_versions()
Tested in two pandas versions.
pandas 0.24.1:
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
pandas 0.23.4:
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None