-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Change default for Index.union sort #25007
API: Change default for Index.union sort #25007
Conversation
pandas/core/indexes/base.py
Outdated
|
||
.. versionadded:: 0.24.0 | ||
|
||
.. versionchanged:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "0.24.1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. We should then do the same for intersection and difference?
Not intersection. Probably difference and symmetric_difference.
Just to confirm for posterity, good for 0.24.1? If so I'll get back to it in ~4 hours to finish things off. |
For me that would be the best option, but we might want confirmation of others. |
Ah yes, I was thinking ahead to the deprecation. But, do we think that |
|
||
.. versionadded:: 0.24.0 | ||
|
||
.. versionchanged:: 0.24.0 | ||
|
||
Changed the default `sort` to None, matching the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this being changed? this is certainly not a regression at all. This was the default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: no behaviour is changed. It was indeed the default, it stays the default. It's only the value that encodes the default that is changed (True -> None), so that True can mean something else (=always sort).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, maybe it should be more clear in the doc-string
I don't see the try/except around sorting for Index.intersection. In [11]: a.intersection(a[::-1], sort=True)
Traceback (most recent call last):
File "<ipython-input-11-2e1c550543d3>", line 1, in <module>
a.intersection(a[::-1], sort=True)
File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/base.py", line 2431, in intersection
taken = sorting.safe_sort(taken.values)
File "/Users/taugspurger/sandbox/pandas/pandas/core/sorting.py", line 459, in safe_sort
ordered = sort_mixed(values)
File "/Users/taugspurger/sandbox/pandas/pandas/core/sorting.py", line 452, in sort_mixed
nums = np.sort(values[~str_pos])
File "/Users/taugspurger/Envs/pandas-dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 934, in sort
a.sort(axis=axis, kind=kind, order=order)
File "pandas/_libs/tslibs/timestamps.pyx", line 258, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
raise TypeError('Cannot compare type %r with type %r' %
TypeError: Cannot compare type 'Timestamp' with type 'int'
In [12]: a.intersection(a[::-1], sort=False)
Out[12]: Index([1, 2000-01-01 00:00:00], dtype='object') I do see the special casing when the indexes are equal
|
Following up on
The first role is served equally well by In [5]: a = pd.Index(['b', 'a'])
In [6]: a.intersection(a, sort=False)
Out[6]: Index(['b', 'a'], dtype='object') The second, I don't think we should be adding that behavior where it wasn't previously. And since Index.intersection didn't previously sort, we don't need to add it. |
Codecov Report
@@ Coverage Diff @@
## master #25007 +/- ##
===========================================
- Coverage 92.38% 42.88% -49.51%
===========================================
Files 166 166
Lines 52401 52409 +8
===========================================
- Hits 48410 22474 -25936
- Misses 3991 29935 +25944
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25007 +/- ##
==========================================
+ Coverage 92.37% 92.38% +<.01%
==========================================
Files 166 166
Lines 52397 52416 +19
==========================================
+ Hits 48404 48425 +21
+ Misses 3993 3991 -2
Continue to review full report at Codecov.
|
Agreed! |
Sorry, not sure what I was looking at .. :) It does have the "directly return if equals" behaviour that needs to handle Fully agreed a |
doc/source/whatsnew/v0.24.1.rst
Outdated
|
||
When ``sort=True`` is provided to :meth:`Index.intersection`, the values are always sorted. In 0.24.0, | ||
the values would not be sorted when ``self`` and ``other`` were identical. Pass ``sort=False`` to not | ||
sort the values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it clearer that this was new behaviour in 0.24.0, and that no behaviour changed compared to what you could do on 0.23 ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
When ``sort=True`` is provided to :meth:`Index.intersection`, the values are always sorted. In 0.24.0, | ||
the values would not be sorted when ``self`` and ``other`` were identical. Pass ``sort=False`` to not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am -1 on this change. We do NOT do this elsewhere, e.g. .reindex
, so this is extra useless sorting. (basically cases 1 and 2 above). I am not sure of the utility of 3 at all. We cannot guarantee sorting, showing a warning is fine ; this has been this way since pandas inception. I don't see any utility in changing this.
@jreback can you clarify this comparison to reindex? Reindexing does not involve any sorting? |
@jorisvandenbossche |
Yes, but |
not at all. this is the same. changing semantics like this is simply not warranted. |
Changes like this need to sit in master. I am -1 on doing this for 0.24.x at all. There is no reason to change at the last minute like this. |
Did the same change for symmdiff, so that |
The same also needs to be done for |
Sorry, yes... |
I have not wavered on this and am -1 I see no reason to not simply do the change in 0.25 |
Because then we would need to deprecate the @jreback can you please be more specific on what you are -1? Which of the following things do you object:
|
(would you be able to come to gitter? that might be easier to try to come to an agreement) |
No! From
I thought you were OK with it 😢 |
I really, really, really, think we should be doing this soon. FYI, MultiIndex.difference was different from Index.difference in two ways
(haven't fixed this yet). |
this is the problem |
For the record I am -1 on this change as is, but @jorisvandenbossche and @TomAugspurger are going ahead. |
Since MultiIndex.difference didn't have the silent "don't sort if not possible behavior", I haven't chosen to implement it yet. # 0.23.4
In [25]: a = pd.MultiIndex.from_product([[1, pd.Timestamp('2000'), 0], [1, 2]])
In [26]: a.difference([(0, 1)])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-d5517c8a9130> in <module>
----> 1 a.difference([(0, 1)])
~/miniconda3/envs/pandas-0.24.0/lib/python3.7/site-packages/pandas/core/indexes/multi.py in difference(self, other, sort)
2984 difference = this.values.take(label_diff)
2985 if sort:
-> 2986 difference = sorted(difference)
2987
2988 if len(difference) == 0:
pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__()
TypeError: Cannot compare type 'Timestamp' with type 'int' do people have thoughts on that? |
@jreback not if you're -1. Pandas operates by consensus. |
@TomAugspurger if I am -1 then there is clearly not consensus, as my concerns as laid out have not been addressed. This is way to many and too fast changes for a minor release. |
@jreback concrete proposal:
What do you think about this? It would allow us to go forward with releasing 0.24.1today without needing to resolve the full discussion today. And if not OK, can you please be specific what in there does not address your concerns? |
If you’re -1 then we just aren’t doing it.
…________________________________
From: Jeff Reback <notifications@github.com>
Sent: Thursday, January 31, 2019 09:20
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] API: Change default for Index.union sort (#25007)
@TomAugspurger<https://github.com/TomAugspurger> if I am -1 then there is clearly not consensus, as my concerns as laid out have not been addressed. This is way to many and too fast changes for a minor release.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25007 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIj528BnanUEKFqm-S3gxu71NM5kcks5vIwnIgaJpZM4aYnPM>.
|
@jorisvandenbossche your proposal is fine. |
@jorisvandenbossche do you have time to implement |
@TomAugspurger yes, will do that |
Opened #25151 for the rest. |
Closes #24959
Haven't done MultiIndex yet, just opening for discussion on if we should do this for 0.24.1.