-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame outer merge changes key columns from int64 to float64 #8596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hmm, this should work. dig in if you can. |
EDIT: no, it's not.
which means that frames are merged along axis=1, but this operation requires adding rows to both frames which introduces NaNs and converts integral columns. We should probably concatenate columns present in both frames along axis=0 (and reorder resulting rows as necessary) beforehand in |
I think it makes sense post-merge (and can do in |
The missing values aren't filled in until A quick look at performance didn't show anything bad:
If you think the approach is sound, I'll add a few more tests (e.g. for multi-key joins), make sure we have a vbench test that exercises the new code, and tidy things up a bit |
You may lose lower digits of int64 numbers > 2**53 when converting int64 -> float64 -> int64: In [62]: np.float64((1<<53))
Out[62]: 9007199254740992.0
In [63]: np.float64((1<<53)) + 1.
Out[63]: 9007199254740992.0
In [64]: df2 = pd.DataFrame({'k': (1<<53) + np.delete(np.arange(5), -2), 'v2': np.arange(4)}); df2
Out[64]:
k v2
0 9007199254740992 0
1 9007199254740993 1
2 9007199254740994 2
3 9007199254740996 3
In [65]: df1 = pd.DataFrame({'k': (1<<53) + np.delete(np.arange(5), -1), 'v1': np.arange(4)}); df1
Out[65]:
k v1
0 9007199254740992 0
1 9007199254740993 1
2 9007199254740994 2
3 9007199254740995 3
In [66]: m = df1.merge(df2, how='outer'); m
Out[66]:
k v1 v2
0 9.007199e+15 0 0
1 9.007199e+15 1 1
2 9.007199e+15 2 2
3 9.007199e+15 3 NaN
4 9.007199e+15 NaN 3
In [67]: m['k'].astype(np.int64)
Out[67]:
0 9007199254740992
1 9007199254740992
2 9007199254740994
3 9007199254740996
4 9007199254740996
Name: k, dtype: int64 It is a far-fetched example, but I think this should be fixed (at least someday). |
yeh I recall talking about on an issue a while back what we need is a quick test when this type of conversion is wrong (then maybe raise/warn) - maybe some sort of bit shift and sum or something |
Here's an implementation of |
@miketkelly want to do a PR on the branch above? see how it does for passing travis. (I haven't really looked thru it yet though). If you can do this in next few days can squeeze this into 0.15.2. Pls also post a perf summary (well show if their is any change). |
Bump. I'm experiencing this when doing outer joins as well. Does anyone have a work-around for the time being? E.g., an easy way to either convert the data types back or to prevent it from happening in the first place? I don't want to end up in a position where going from |
no easy way around this. you can simply convert the dtypes for the keys after. |
Presently having an issue with this. The loss of precision from int64->float64 is problematic for me as I'm working with data that is stored as an int64 type. I'm not sure if converting the dtypes after the merge operation will work if precision is lost..., if I'm understanding @jreback 's solution correctly. |
closes pandas-dev#8596 xref to pandas-dev#13169 as assignment of Index of bools not retaining dtype
Should this be fixed in 0.19.2? I see this:
EDIT: disregard that, it seems that some data wasn't available, so the column was converted to |
I have the following issue: dtype of a column other than the key changes from int to float after left joining with pd.merge(). Please tell me, if you need more information to dig into! |
I have the same issue as @MagdalenaDeschner,
Interestingly
|
@suvayu that is the expected outcome for outer, you are introducing NaN's force the conversion |
Hmm, I guess that makes sense. Sorry for the noise. |
I definitely lose data on the int->float conversion during the merge. When i cast back from float->int my id's stop working. Any idea how this can be avoided if at all? |
@theholy7 old-school numpy integers don't have NaN values, so the only option is to use a different dtype. Depending on the performance drop you can bear your options are:
|
Was expecting
key
to stay int64, since a merge can't introduce missing key values if none were present in the inputs.Version 0.15.0-6-g403f38d
The text was updated successfully, but these errors were encountered: