Skip to content

Should DataFrame.merge match NaN with NaN? #22491

Closed
@alexlenail

Description

@alexlenail

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')

Problem description

df1:

  a b
0 1 4.0
1 2 5.0
2 3 NaN

df2:

  c d
0 6 4.0
1 7 NaN
2 8 NaN
3 9 5.0

Current output:

  a b c d
0 1 4.0 6 4.0
1 2 5.0 9 5.0
2 3 NaN 7 NaN
3 3 NaN 8 NaN

Expected Output

  a b c d
0 1 4.0 6 4.0
1 2 5.0 9 5.0

What's happening is the NaN is df1.b is matching the NaNs in df2.d.

I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.

What do you think?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions