-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Index set operations issues #13432
Comments
The issue is with the sorting: In [8]: pd.Index([0, 1, 'A', 'B']).sort_values()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-cd271f2d641b> in <module>()
----> 1 pd.Index([0, 1, 'A', 'B']).sort_values()
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py in sort_values(self, return_indexer, ascending)
1565 Return sorted copy of Index
1566 """
-> 1567 _as = self.argsort()
1568 if not ascending:
1569 _as = _as[::-1]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py in argsort(self, *args, **kwargs)
1639 if result is None:
1640 result = np.array(self)
-> 1641 return result.argsort(*args, **kwargs)
1642
1643 def __add__(self, other):
TypeError: unorderable types: str() > int() We could try removing the sorting |
These should be catching the |
I thought the both difference functions could be rewritten in the style of BTW, not directly related but still in a set function. idx1 = pd.Index([1, 2], name='A')
idx2 = pd.Index([], name='B')
idx1.union(idx2)
Out[26]: Index([], dtype='object')
idx2.union(idx1)
Out[27]: Index([], dtype='object') |
consensus is the must agree or 1 is None, otherwise it is None. changing sorting is a big deal, I suspect lots of things will break; further it provides a nice guarantee on the result set. These are essentially set operations into an ordered list. |
Yes, this is what is written in the docstring. But it should affect only the name, not the values of the output. Am I right? |
I should have mentioned that df = pd.DataFrame([[0, 1, 2]], columns = ['A', 'B', 0])
df.groupby('A').sum()
Traceback (most recent call last):
File "/usr/share/python3.5/site-packages/IPython/core/interactiveshell.py", line 3066, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-225-c27b3941f9c8>", line 1, in <module>
df.groupby('A').sum()
File "/usr/local/lib64/python3.5/site-packages/pandas/core/groupby.py", line 97, in f
self._set_selection_from_grouper()
File "/usr/local/lib64/python3.5/site-packages/pandas/core/groupby.py", line 469, in _set_selection_from_grouper
self._group_selection = ax.difference(Index(groupers)).tolist()
File "/usr/local/lib64/python3.5/site-packages/pandas/indexes/base.py", line 1861, in difference
theDiff = sorted(set(self) - set(other))
TypeError: unorderable types: str() < int() |
Bugs in
|
@pijucha yeah these duplicates are prob not tested much if at all. |
I have an almost ready PR that fixes a part of this issue, and also #12044 and #12814. But before I submit it I have some questions:
idx1 = pd.Index([1, np.nan, 2])
idx2 = pd.Index([1, np.nan, 3])
idx1.union(idx2)
Out[262]: Float64Index([1.0, 2.0, 3.0, nan], dtype='float64')
idx1.intersection(idx2)
Out[263]: Float64Index([1.0, nan], dtype='float64')
idx1.difference(idx2)
Out[264]: Float64Index([nan, 2.0], dtype='float64')
idx1.symmetric_difference(idx2)
Out[265]: Float64Index([nan, 2.0, 3.0], dtype='float64')
Update For
|
1. Added an internal `safe_sort` to safely sort mixed-integer arrays in Python3. 2. Changed Index.difference and Index.symmetric_difference in order to: - sort mixed-int Indexes (pandas-dev#13432) - improve performance (pandas-dev#12044) 3. Fixed DataFrame.join which raised in Python3 with mixed-int non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814) 4. Fixed Index.union returning an empty Index when one of arguments was a named empty Index (pandas-dev#13432)
Another example of inconsistent sorting. This output of
But python 2 is unpredictable with mixed type indexes: In [3]: mixed = pd.Index([0, 'a', 1])
# not sorted
In [4]: mixed.union([1, 2])
Out[4]: Index([0, u'a', 1, 2], dtype='object')
# but this is sorted
In [5]: mixed.union([1])
Out[5]: Index([0, 1, u'a'], dtype='object')
In [6]: pd.show_versions()
...
python: 2.7.11.final.0
pandas: 0.18.1+218.g506520b (Python 3 sorts neither of these.) |
There's an interaction between non-uniqueness and NaNs: The NaNs are dropped from the intersection if there's more than one: In [19]: a = pd.Index([1, 2, float('nan')])
In [20]: a & pd.Index([2, np.nan])
Out[20]: Float64Index([2.0, nan], dtype='float64')
In [21]: a & pd.Index([2, np.nan, np.nan])
Out[21]: Float64Index([2.0], dtype='float64') It's any duplicates in the RHS, not just NaN: In [25]: a = pd.Index([1, 2, 2, float('nan')])
In [26]: a & pd.Index([2, 2, np.nan])
Out[26]: Float64Index([2.0, 2.0], dtype='float64')
In [27]: a & pd.Index([2, np.nan])
Out[27]: Float64Index([2.0, 2.0, nan], dtype='float64') |
Proposal for duplicate set opsintersection The output of >>> pd.Index(['0', '0', '1', '1']).intersection(pd.Index(['0', '1', '1', '2']))
Index(['0', '1', '1']) Union The output of >>> pd.Index(['0', '0', '1', '1']).intersection(pd.Index(['0', '1', '1', '2']))
Index(['0', '0', '1', '1', '2']) This matches the definitions on wikipedia on in http://multiset.readthedocs.io/en/stable/ |
Hello, When Using MultiIndex with Categorical levels, I found that the Categorical dtype is lost during set operations like The following code shows the effect: import pandas as pd
pd.show_versions()
cat = pd.CategoricalDtype(categories=[1, 2, 3, 4])
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 2, 1], 'b': [4, 3, 2, 3, 1, 2, 3]}, dtype=cat)
# flat categorical index, no MultiIndex
df_cat_ind = df.set_index('a', drop=False)
diff_ind = df_cat_ind.index.difference(df_cat_ind.head(3).index)
print(diff_ind.dtype) # -> category
diff_ind = df_cat_ind.index.union(df_cat_ind.head(3).index)
print(diff_ind.dtype) # -> category
# same set operations on MultiIndex, categorical gets lost
df_cat_multi_ind = df.set_index(['a', 'b'], drop=False)
# just make sure, we have a categorical in the first place
print(df_cat_multi_ind.index.get_level_values(0).dtype) # -> category
print(df_cat_multi_ind.index.get_level_values(1).dtype) # -> category
diff_multi_ind = df_cat_multi_ind.index.difference(df_cat_multi_ind.head(3).index)
print(diff_multi_ind.get_level_values(0).dtype) # -> int64
print(diff_multi_ind.get_level_values(1).dtype) # -> int64
diff_multi_ind = df_cat_multi_ind.index.union(df_cat_multi_ind.head(3).index)
print(diff_multi_ind.get_level_values(0).dtype) # -> int64
print(diff_multi_ind.get_level_values(1).dtype) # -> int64 So, the categorical dtype is lost, although the two sets are based on the identical categories (even the same instance). EDIT (added version output):
|
try on master there has been a lots of working in this recently pls open a new issue as well |
This looks to work on master now. Could use a test
|
Issues:
Index.difference
andsymmetric_difference
raise for mixed types (solved by BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514)Index._get_consensus_name
incorrectly returns an empty index (solved by BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514)Index.union
for non-unique indexes (see comment)Index.intersection
for non-unique indexes (same comment)Index.difference
andsymmetric_difference
raise for mixed typesCode Sample, a copy-pastable example if possible
But
union
andintersection
work:Expected Output
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: