Skip to content

join on tz-aware column and tz-aware index wrongly fails (regression from 0.22) #23931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
louispotok opened this issue Nov 26, 2018 · 5 comments · Fixed by #25260
Closed

join on tz-aware column and tz-aware index wrongly fails (regression from 0.22) #23931

louispotok opened this issue Nov 26, 2018 · 5 comments · Fixed by #25260
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@louispotok
Copy link
Contributor

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame(
    {
        'date': pd.date_range(start='2018-01-01', periods=5,tz='America/Chicago'), 
        'vals':list('abcde')
    }
)

df2 = pd.DataFrame(
    {
        'date': pd.date_range(start='2018-01-03',periods=5, tz='America/Chicago'), 
        'vals_2': list('tuvwx')
    }
)
joined = df1.join(df2.set_index('date'),on='date')

Problem description

In 0.23.4, this raises:
ValueError: You are trying to merge on datetime64[ns, America/Chicago] and datetime64[ns] columns. If you wish to proceed you should use pd.concat

This is incorrect - the two columns are both tz-aware and indeed df2.set_index('date').index.dtype == df1.date.dtype returns True.

In 0.22.0, this does not raise, and returns the expected output. Haven't run git bisect yet but seems like the logical next step.

Expected Output

0  2018-01-01 00:00:00-06:00    a    NaN
1  2018-01-02 00:00:00-06:00    b    NaN
2  2018-01-03 00:00:00-06:00    c      t
3  2018-01-04 00:00:00-06:00    d      u
4  2018-01-05 00:00:00-06:00    e      v

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1072-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.2.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Somewhere in self._get_merge_keys we coerce right's index to a ndarray

ipdb> self.right_join_keys
[array(['2018-01-03T06:00:00.000000000', '2018-01-04T06:00:00.000000000',
       '2018-01-05T06:00:00.000000000', '2018-01-06T06:00:00.000000000',
       '2018-01-07T06:00:00.000000000'], dtype='datetime64[ns]')]
ipdb>

I'd recommend going through core/reshape/merge.py::_MergeOperation._get_merge_keys to see where things go wrong.

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Nov 26, 2018
@mroeschke
Copy link
Member

I think #21184 may be affected by the same root cause as this issue.

@louispotok
Copy link
Contributor Author

Thanks @TomAugspurger ! I think this patch should fix the issue (and added tests), if that looks about right I'll submit a PR.

louispotok@b29c6bc

@mroeschke haven't checked if this also fixed #21184 but will take a look.

@zhuoqiang
Copy link

same bug here:

while the following works fine

a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz=None)})
b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz=None))
a.join(b, on='x', lsuffix='_')

if we change tz to a specific time zone, like following:

a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc')})
b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc'))
a.join(b, on='x', lsuffix='_')

it failed:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-84-30f4c345c06a> in <module>
      1 a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc')})
      2 b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc'))
----> 3 a.join(b, on='x', lsuffix='_')

lib/python3.7/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

lib/python3.7/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63 

lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    553         # validate the merge keys dtypes. We may need to coerce
    554         # to avoid incompat dtypes
--> 555         self._maybe_coerce_merge_keys()
    556 
    557         # If argument passed to validate,

lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    990                 raise ValueError(msg)
    991             elif is_datetime64tz_dtype(lk) and not is_datetime64tz_dtype(rk):
--> 992                 raise ValueError(msg)
    993             elif not is_datetime64tz_dtype(lk) and is_datetime64tz_dtype(rk):
    994                 raise ValueError(msg)

ValueError: You are trying to merge on datetime64[ns, UTC] and datetime64[ns] columns. If you wish to proceed you should use pd.concat

pandas.version == '0.23.4'

@TomAugspurger
Copy link
Contributor

FYI, pd.merge works fine for these. Something is buggy in join before it calls the merge code.

@mroeschke mroeschke added the Timezones Timezone data dtype label Jan 9, 2019
@jreback jreback added this to the 0.24.2 milestone Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants