-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC/TST: is pd.unique and the order it returns API? #9346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would like to suggest that Numpy actually specifies that the returned values are sorted: http://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html
|
iirc pd.unique has a much greater perf and keeps things in the same order as compared to np.unique mainly by using a hashable impl however I don't know the actual guarantees |
Just as a referece: In[11]: s = pd.Series([1, 3, 2, np.nan, 1,4])
In[12]: s.unique()
Out[12]: array([ 1., 3., 2., nan, 4.])
In[13]: np.unique(s)
Out[13]: array([ 1., 2., 3., 4., nan])
In[14]: s = pd.Series(["a", "c", "b", np.nan, "a", "d"])
In[15]: s.unique()
Out[15]: array(['a', 'c', 'b', nan, 'd'], dtype=object)
In[16]: np.unique(s)
Out[16]:
array(['a', 'b', 'c', 'd', 'nan'],
dtype='|S3') |
The underlying implementation is defined in hashtable.pyx, e.g.,: https://github.com/pydata/pandas/blob/3030bba1f151160f7e4cdfba8d3a00276b07be17/pandas/hashtable.pyx#L785 So it pretty clearly does not sort, just loops through the values (using a hashtable) and appends when not found. I like the non-sorting behavior of pd.unique for three reasons:
|
Looking briefly at @shoyer is right about the looping (was just going to post that), and I agree that non-sorting is desirable since it's more composable (and faster). |
Can someone give me an example of "more composable"? I buy the faster, but I haven't seen a case where a "sorted by appearance" sorting of unique is better than a sorted unique. |
@JanSchulz |
The problem I see here is that the current order is kind of random as it depends on the implementation of hashlib which drives unique. If that would change, unique would also change or, if your proposal would be chosen, would need a special sorting step which would be potentially costlier than simple sorting. I also don't see much value in having a "in order of appearance" order of unique. My datasets usually are sorted by some (arbitrary) id and so "order of appearance" makes no sense and I would usually have to sort that anyway. And I don't get why I should sort my complete dataframe just to get a sorted unique and thereby a sorted plot like it is implied in the original seaborn bug report. |
Just to be entirely clear, the order of unique here is entirely determined by the Right now, the order from unique this is technically an implementation detail, but it's one that is assuredly relied on implicitly in user code. The only way it would break it is if we changed the underlying implementation of unique, which seems unlikely. I think we should make it API so users can safely rely on it and predict exactly what it will do. @JanSchulz not quite sure what you're getting at there:
You don't need to sort your complete frame -- you can use |
Ok, just saw it using the klib hastable and didn't look up the actual implementation. The implemenation seems to work also in other hastyble implementation (as long as a "this value is already in the table" indicator exists), but what if a new version of hastable brings with it a Re the last thing: What I don't buy is that to get the unique values sorted in a special way (for plotting,...), you want/need to manipulate the dataframe. IMO that's a waste of CPU cycles, whether it is Making |
@JanSchulz If you want sorted unique values, you can use This works for every type other than categorical. If you want to add a
This may very well be possible but it seems pretty far-fetched to me. |
I'm not so much arguing for sorting but for making it clear that the order is undefined and can change. I do think that seaborn should sort, though. |
Here are my 2c.
so @shoyer top-of-pr is fine by me. Also ok with making the ordering 'undefined'. |
Looking at the above discussion, I think there are three possibilities for
I personally don't think that case 1) is a good idea, as people will always start relying on the output order in their code, even if we document that they shouldn't. So I would personally strive to have a well-defined order (and don't let users rely upon an implementation detail). So, if we think we can guarantee the output order with the Hashtable implementation, why not documenting this? So people can safely rely on this, as they already will do. |
Just for the reference: numpy sorts since 2006. Not sure how to get to the earlier unique implementation: https://github.com/numpy/numpy/blob/0ab9ae36e8484d06f2a8e70c6fe45004e73408b4/numpy/lib/arraysetops.py |
jumping in here...
API guarantees are not the same as documenting implementation details. confusing the two makes for bad maintainability and bad API design.
... and they can't make user code run faster by simply dropping in the faster algorithm as a result, because they made unique and sorted unique the same thing. exactly. EDIT: maybe it would be good to document that |
suggestion is to add a |
And I think we only need that `kind` keyword if we ever move to an
implementation that doesn't preserve order.
Then we can maintain backwards compat with the `kind` argument, and move
people to the new behavior with a FutureWarning.
|
you'll just bomb lots of users with useless false warnings. pandas should leave unique as it is, and ask (only) users relying on undocumented behavior to fix their code now and use the new method/keyword. after that if a better algo comes along you put it in without breaking any code and make things faster for all users of |
@shoyer this is just docs at this point (maybe some tests as well), right? |
@jreback Mostly yes, though I might also do some tweaks to pd.unique to ensure it works on all dtypes (e.g., on categorical) |
ok, i'll move to 0.16.1 (don't want to hold us up on this issue). as its not going to be an api change, just docs,tests and such. |
Looking at the API docs, neither the function
pd.unique
nor the order of the unique values fromunique
is mentioned. I would like to:pd.unique
to API > General functions > Data manipulationsSeries.unique()
andunique()
return values in order of appearanceunique
(untested directly AFAICT, though I'm sure it's relied on implicitly)Any objections? If not, I'll put together a PR.
This lack of documentation has caused some hesitation for relying on this functionality in Seaborn: mwaskom/seaborn#361
The text was updated successfully, but these errors were encountered: