Skip to content

BUG: DataFrame outer merge changes key columns from int64 to float64 #8596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miketkelly opened this issue Oct 21, 2014 · 18 comments
Closed
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@miketkelly
Copy link

import pandas as pd

df1 = pd.DataFrame({'key': [1,2,3,4], 'val1': [1,2,3,4]})
df2 = pd.DataFrame({'key': [1,2,3,5], 'val2': [1,2,3,4]})

df = df1.merge(df2, how='outer')

Was expecting key to stay int64, since a merge can't introduce missing key values if none were present in the inputs.

print df.dtypes

key     float64
val1    float64
val2    float64
dtype: object

Version 0.15.0-6-g403f38d

@jreback
Copy link
Contributor

jreback commented Oct 21, 2014

hmm, this should work. dig in if you can.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 21, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 21, 2014
@immerrr
Copy link
Contributor

immerrr commented Oct 22, 2014

I remember paying special attention to these cases when doing blockmanager refactoring, may be related to categorical revamp in 0.15.

EDIT: no, it's not. concatenate_block_managers gets the following:

[(BlockManager
  Items: Index([u'key', u'val1'], dtype='object')
  Axis 1: Int64Index([0, 1, 2, 3], dtype='int64')
  IntBlock: slice(0, 2, 1), 2 x 4, dtype: int64,
  {1: array([ 0,  1,  2,  3, -1])}),
 (BlockManager
  Items: Index([u'val2'], dtype='object')
  Axis 1: Int64Index([0, 1, 2, 3], dtype='int64')
  IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64,
  {1: array([ 0,  1,  2, -1,  3])})]

which means that frames are merged along axis=1, but this operation requires adding rows to both frames which introduces NaNs and converts integral columns. We should probably concatenate columns present in both frames along axis=0 (and reorder resulting rows as necessary) beforehand in merge._MergeOperation.

@jreback
Copy link
Contributor

jreback commented Oct 22, 2014

I think it makes sense post-merge (and can do in concatenate_join_unit), something like this: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1063 (which is cheap unless it actually changes the dtype), though maybe have to split out the results

@miketkelly
Copy link
Author

The missing values aren't filled in until _maybe_add_join_keys, so I made the change there: 90ef7dfc528fffd4d71a5c544864891a44c2d096. One test failed initially, but it just seemed to be working around the issue. For now I commented out the workaround.

A quick look at performance didn't show anything bad:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
join_non_unique_equal                        |   0.6567 |   0.6644 |   0.9884 |
join_dataframe_index_multi                   |  15.1957 |  15.3310 |   0.9912 |
join_dataframe_index_single_key_bigger_sort  |  11.3726 |  11.4484 |   0.9934 |
strings_join_split                           |  32.7196 |  32.6459 |   1.0023 |
merge_2intkey_sort                           |  28.4477 |  28.3287 |   1.0042 |
merge_2intkey_nosort                         |  11.7714 |  11.6763 |   1.0081 |
join_dataframe_integer_2key                  |   3.6847 |   3.6399 |   1.0123 |
left_outer_join_index                        | 2123.6037 | 2092.0046 |   1.0151 |
join_dataframe_index_single_key_small        |   9.6610 |   9.4910 |   1.0179 |
join_dataframe_index_single_key_bigger       |  10.1710 |   9.8880 |   1.0286 |
join_dataframe_integer_key                   |   1.4577 |   1.4137 |   1.0311 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

If you think the approach is sound, I'll add a few more tests (e.g. for multi-key joins), make sure we have a vbench test that exercises the new code, and tidy things up a bit

@immerrr
Copy link
Contributor

immerrr commented Oct 23, 2014

You may lose lower digits of int64 numbers > 2**53 when converting int64 -> float64 -> int64:

In [62]: np.float64((1<<53))
Out[62]: 9007199254740992.0

In [63]: np.float64((1<<53)) + 1.
Out[63]: 9007199254740992.0

In [64]: df2 = pd.DataFrame({'k': (1<<53) + np.delete(np.arange(5), -2), 'v2': np.arange(4)}); df2
Out[64]: 
                  k  v2
0  9007199254740992   0
1  9007199254740993   1
2  9007199254740994   2
3  9007199254740996   3

In [65]: df1 = pd.DataFrame({'k': (1<<53) + np.delete(np.arange(5), -1), 'v1': np.arange(4)}); df1
Out[65]: 
                  k  v1
0  9007199254740992   0
1  9007199254740993   1
2  9007199254740994   2
3  9007199254740995   3

In [66]: m = df1.merge(df2, how='outer'); m
Out[66]: 
              k  v1  v2
0  9.007199e+15   0   0
1  9.007199e+15   1   1
2  9.007199e+15   2   2
3  9.007199e+15   3 NaN
4  9.007199e+15 NaN   3

In [67]: m['k'].astype(np.int64)
Out[67]: 
0    9007199254740992
1    9007199254740992
2    9007199254740994
3    9007199254740996
4    9007199254740996
Name: k, dtype: int64

It is a far-fetched example, but I think this should be fixed (at least someday).

@jreback
Copy link
Contributor

jreback commented Oct 23, 2014

yeh I recall talking about on an issue a while back
I think the problem with the int-float-int comversion was that it's not easy to figure out that it is going to fail because of precision - well it's easy just not performant for the very tiny amounts of times this would happen

what we need is a quick test when this type of conversion is wrong (then maybe raise/warn) - maybe some sort of bit shift and sum or something

@miketkelly
Copy link
Author

Here's an implementation of _maybe_add_join_keys that also solves the int->float->int problems: https://github.com/mtkni/pandas/commit/e79b97815d3ec884261674e91af2c0a138debead. The performance seems comparable to the previous implementation (no material vbench differences), but that needs more study.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2014

@miketkelly want to do a PR on the branch above? see how it does for passing travis. (I haven't really looked thru it yet though). If you can do this in next few days can squeeze this into 0.15.2. Pls also post a perf summary (well show if their is any change).

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@bstempi
Copy link

bstempi commented Apr 12, 2016

Bump. I'm experiencing this when doing outer joins as well.

Does anyone have a work-around for the time being? E.g., an easy way to either convert the data types back or to prevent it from happening in the first place? I don't want to end up in a position where going from int64-> float64 causes any issues.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

no easy way around this. you can simply convert the dtypes for the keys after.

@timseries
Copy link

Presently having an issue with this. The loss of precision from int64->float64 is problematic for me as I'm working with data that is stored as an int64 type.

I'm not sure if converting the dtypes after the merge operation will work if precision is lost..., if I'm understanding @jreback 's solution correctly.

@abbradar
Copy link

abbradar commented Apr 26, 2017

Should this be fixed in 0.19.2? I see this:

info = pd.read_csv(filename + '.info', sep=' ', index_col=0, names=('req_id', 'user_req_id', 'recv_ticks'), dtype={'req_id': np.uint32, 'user_req_id': np.uint32, 'recv_ticks': np.uint32}, header=None)
send = pd.read_csv(filename + '.send', sep=' ', index_col=0, names=('req_id', 'send_ts'), dtype={'req_id': np.uint32, 'send_ticks': 'datetime64[ns]'}, header=None)
transmit = pd.read_csv(filename + '.transmit', sep=' ', index_col=0, names=('req_id', 'ts_ticks'), dtype={'req_id': np.uint32, 'ts_ticks': np.uint32}, header=None)

data = info.join(send).join(transmit)

print(transmit['ts_ticks'].dtype, data['ts_ticks'].dtype)
# uint32 float64

EDIT: disregard that, it seems that some data wasn't available, so the column was converted to float64 to represent NAs.

@lena-kuhn
Copy link

I have the following issue: dtype of a column other than the key changes from int to float after left joining with pd.merge(). Please tell me, if you need more information to dig into!
merge2 = pd.DataFrame(pd.merge(merge1, appevents, how="left", on="event_id"))

@suvayu
Copy link
Contributor

suvayu commented May 10, 2018

I have the same issue as @MagdalenaDeschner, dtype of other columns change to float. This is troublesome because in my case some of those columns store ids and counts. You can see the issue by varying the example in the changelog a bit.

>>> df1 = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})
>>> df2 = pd.DataFrame({'key': [2, 2], 'id': [1, 2], 'v2': [21, 31]})
>>> df1
   key  v1
0    1  20
1    2  30
>>> df2
   id  key  v2
0   1    2  21
1   2    2  31
>>> pd.merge(df1, df2, how='outer')                                          
   key  v1   id    v2
0    1  20  NaN   NaN
1    2  30  1.0  21.0
2    2  30  2.0  31.0

Interestingly inner does not have the issue, also it seems to be happening for columns in one of the dataframes (maybe the repeated value of the key in the second dataframe is why).

>>> pd.merge(df1, df2, how='inner')
   key  v1  id  v2
0    2  30   1  21
1    2  30   2  31

@jreback
Copy link
Contributor

jreback commented May 10, 2018

@suvayu that is the expected outcome for outer, you are introducing NaN's force the conversion

@suvayu
Copy link
Contributor

suvayu commented May 10, 2018

Hmm, I guess that makes sense. Sorry for the noise.

@theholy7
Copy link

theholy7 commented Mar 6, 2020

I definitely lose data on the int->float conversion during the merge. When i cast back from float->int my id's stop working.

Any idea how this can be avoided if at all?

@immerrr
Copy link
Contributor

immerrr commented Mar 15, 2020

@theholy7 old-school numpy integers don't have NaN values, so the only option is to use a different dtype. Depending on the performance drop you can bear your options are:

  • dtype='object' -- probably worst possible case, all values are stored as garbage collected PyObjects (consuming 28 bytes instead of 8) and no vectorization is possible
  • dtype='category' -- all values are stored as integer indexes into a different numpy array, much cheaper memory wise, but arithmetic operations might be tricky
  • dtype='Int64' -- new kid on the block, should be most efficient on both memory and arithmetics, but still marked as experimental API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants