Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index set operations issues #13432

Closed
2 of 5 tasks
pijucha opened this issue Jun 13, 2016 · 17 comments · Fixed by #41482
Closed
2 of 5 tasks

BUG: Index set operations issues #13432

pijucha opened this issue Jun 13, 2016 · 17 comments · Fixed by #41482
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@pijucha
Copy link
Contributor

pijucha commented Jun 13, 2016

Issues:


Index.difference and symmetric_difference raise for mixed types

Code Sample, a copy-pastable example if possible

idx1 = pd.Index([0, 1, 'A', 'B'])
idx2 = pd.Index([0, 2, 'A', 'C'])

idx1.difference(idx2)
...
  File "/usr/local/lib64/python3.5/site-packages/pandas/indexes/base.py", line 1861, in difference
    theDiff = sorted(set(self) - set(other))
TypeError: unorderable types: str() < int()

idx1.symmetric_difference(idx2)
...
  File "/usr/local/lib64/python3.5/site-packages/pandas/indexes/base.py", line 1861, in difference
    theDiff = sorted(set(self) - set(other))
TypeError: unorderable types: str() < int()

But union and intersection work:

idx1.union(idx2)
Out[14]: Index([0, 1, 'A', 'B', 2, 'C'], dtype='object')
idx1.intersection(idx2)
Out[15]: Index([0, 'A'], dtype='object')

Expected Output

idx1.difference(idx2)
Out[]: Index([1, 'B'], dtype='object')

idx1.symmetric_difference(idx2)
Out[]: Index([1, 'B', 2, 'C'], dtype='object')

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 4a6621fcaa9a3b98172b90a69de574ec94b108df
python: 3.5.1.final.0
python-bits: 64
OS: Linux
machine: x86_64
byteorder: little

pandas: 0.18.1
...
@sinhrks sinhrks added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Jun 13, 2016
@max-sixty
Copy link
Contributor

The issue is with the sorting:

In [8]: pd.Index([0, 1, 'A', 'B']).sort_values()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-cd271f2d641b> in <module>()
----> 1 pd.Index([0, 1, 'A', 'B']).sort_values()

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py in sort_values(self, return_indexer, ascending)
   1565         Return sorted copy of Index
   1566         """
-> 1567         _as = self.argsort()
   1568         if not ascending:
   1569             _as = _as[::-1]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py in argsort(self, *args, **kwargs)
   1639         if result is None:
   1640             result = np.array(self)
-> 1641         return result.argsort(*args, **kwargs)
   1642 
   1643     def __add__(self, other):

TypeError: unorderable types: str() > int()

We could try removing the sorting

@jreback
Copy link
Contributor

jreback commented Jun 13, 2016

These should be catching the TypeError as numpy doesn't handle this very well. You can slightly update the guarantee to sort if possible.

@pijucha
Copy link
Contributor Author

pijucha commented Jun 13, 2016

I thought the both difference functions could be rewritten in the style of union (and sort only if it is possible), especially since there is a performance issue #12044.

BTW, not directly related but still in a set function.
There's an internal _get_consensus_name(), the purpose of which is not clear to me, and something is not quite right:

idx1 = pd.Index([1, 2], name='A')
idx2 = pd.Index([], name='B')
idx1.union(idx2)
Out[26]: Index([], dtype='object')
idx2.union(idx1)
Out[27]: Index([], dtype='object')

@jreback
Copy link
Contributor

jreback commented Jun 13, 2016

consensus is the must agree or 1 is None, otherwise it is None.

changing sorting is a big deal, I suspect lots of things will break; further it provides a nice guarantee on the result set. These are essentially set operations into an ordered list.

@pijucha
Copy link
Contributor Author

pijucha commented Jun 13, 2016

consensus

Yes, this is what is written in the docstring. But it should affect only the name, not the values of the output. Am I right?

@pijucha
Copy link
Contributor Author

pijucha commented Jun 13, 2016

I should have mentioned that groupby calls Index.difference internally and can trigger this exception:

df = pd.DataFrame([[0, 1, 2]], columns = ['A', 'B', 0])
df.groupby('A').sum()
Traceback (most recent call last):
  File "/usr/share/python3.5/site-packages/IPython/core/interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-225-c27b3941f9c8>", line 1, in <module>
    df.groupby('A').sum()
  File "/usr/local/lib64/python3.5/site-packages/pandas/core/groupby.py", line 97, in f
    self._set_selection_from_grouper()
  File "/usr/local/lib64/python3.5/site-packages/pandas/core/groupby.py", line 469, in _set_selection_from_grouper
    self._group_selection = ax.difference(Index(groupers)).tolist()
  File "/usr/local/lib64/python3.5/site-packages/pandas/indexes/base.py", line 1861, in difference
    theDiff = sorted(set(self) - set(other))
TypeError: unorderable types: str() < int()

@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

@pijucha yeah, guess should guard a bit better against this. Note that this is a safe sorter that I wrote a while back here. Could/should just strip this out to a separate function (and use it here).

@pijucha
Copy link
Contributor Author

pijucha commented Jun 17, 2016

Bugs in union and intersection for non-unique indexes

# This is OK
pd.Index([1, 2, 2]).union(pd.Index([2, 3]))
Out[24]: Int64Index([1, 2, 2, 3], dtype='int64')

# but fails with a non-increasing rhs:
pd.Index([1, 2, 2]).union(pd.Index([3, 2]))
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

# lost 2
pd.Index([3, 2]).union(pd.Index([1, 2, 2]))
Out[26]: Int64Index([1, 2, 3], dtype='int64')

# too many 2's 
pd.Index([1, 2, 2]).union(pd.Index([2, 2, 3]))
Out[27]: Int64Index([1, 2, 2, 2, 3], dtype='int64')

# too many 2's
pd.Index([1, 2, 2]).intersection(pd.Index([2, 3]))
Out[30]: Int64Index([2, 2], dtype='int64')

# too many 2's
pd.Index([1, 2, 2]).intersection(pd.Index([2, 2, 3]))
Out[32]: Int64Index([2, 2, 2], dtype='int64')

I think these "too many 2's" are manifestations of a bug in the algorithms in pd.algos.*_join_indexer_*:

pd.algos.outer_join_indexer_int64(np.array([1, 2, 2]), np.array([2, 2, 3]))
Out[37]: 
(array([1, 2, 2, 2, 3]),
 array([ 0,  1,  1,  2, -1]),
 array([-1,  0,  1,  1,  2]))

@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

@pijucha yeah these duplicates are prob not tested much if at all.

@pijucha
Copy link
Contributor Author

pijucha commented Jun 22, 2016

I have an almost ready PR that fixes a part of this issue, and also #12044 and #12814. But before I submit it I have some questions:

  1. Sortedness:
    If sorting is impossible, should Index.difference and symmetric_difference return (a) an unsorted output (keeping an original order), (b) an unsorted output plus a warning or (c) raise an exception?

    @jreback's previous comments indicate either (a) or (c), so I'm not sure.

  2. Uniqueness:
    Should Index.union and intersection behave like a set or multi-set operations? I.e. should an output contain only unique values?

    If we want a multi-set behaviour then it might be significantly more work and I'd rather leave them unchanged (and buggy) for now.

  3. NaN's
    The current behaviour is as follows (all results contain nan):

idx1 = pd.Index([1, np.nan, 2])
idx2 = pd.Index([1, np.nan, 3])

idx1.union(idx2)
Out[262]: Float64Index([1.0, 2.0, 3.0, nan], dtype='float64')

idx1.intersection(idx2)
Out[263]: Float64Index([1.0, nan], dtype='float64')

idx1.difference(idx2)
Out[264]: Float64Index([nan, 2.0], dtype='float64')

idx1.symmetric_difference(idx2)
Out[265]: Float64Index([nan, 2.0, 3.0], dtype='float64')

I kept this unchanged. But if I had freedom then I'd probably remove nan either from the intersection (consistent with nan != nan) or from the differences (nan == nan).


Update

For difference and symmetric_difference, the issues were solved as follows:

  1. Sort if possible, otherwise return an unsorted result.
  2. Always return unique Index.
  3. nan's are treated as any other elements (consistent with union and intersection).

pijucha added a commit to pijucha/pandas that referenced this issue Jul 17, 2016
1. Added an internal `safe_sort` to safely sort mixed-integer
arrays in Python3.

2. Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (pandas-dev#13432)
- improve performance (pandas-dev#12044)

3. Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814)

4. Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (pandas-dev#13432)
jreback pushed a commit that referenced this issue Jul 19, 2016
fixes some issues from #13432
closes #12044
closes #12814

Author: Piotr Jucha <pi.jucha@gmail.com>

Closes #13514 from pijucha/setop13432 and squashes the following commits:

3a96089 [Piotr Jucha] BUG/PERF: Sort mixed-int in Py3, fix Index.difference
@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

@pijucha so comment / update when you have a chance now that we merged #13514

@pijucha
Copy link
Contributor Author

pijucha commented Jul 19, 2016

Another example of inconsistent sorting.

This output of union is usually sorted:

In [2]: pd.Index([0, 2, 1]).union([1])
Out[2]: Int64Index([0, 1, 2], dtype='int64')

But python 2 is unpredictable with mixed type indexes:

In [3]: mixed = pd.Index([0, 'a', 1])

# not sorted
In [4]: mixed.union([1, 2])          
Out[4]: Index([0, u'a', 1, 2], dtype='object')

# but this is sorted
In [5]: mixed.union([1])   
Out[5]: Index([0, 1, u'a'], dtype='object')

In [6]: pd.show_versions()
...
python: 2.7.11.final.0
pandas: 0.18.1+218.g506520b

(Python 3 sorts neither of these.)

@pijucha pijucha changed the title Index.difference and symmetric_difference raise for mixed types BUG: Index set operations issues Jul 19, 2016
@TomAugspurger
Copy link
Contributor

There's an interaction between non-uniqueness and NaNs: The NaNs are dropped from the intersection if there's more than one:

In [19]: a = pd.Index([1, 2, float('nan')])

In [20]: a & pd.Index([2, np.nan])
Out[20]: Float64Index([2.0, nan], dtype='float64')

In [21]: a & pd.Index([2, np.nan, np.nan])
Out[21]: Float64Index([2.0], dtype='float64')

It's any duplicates in the RHS, not just NaN:

In [25]: a = pd.Index([1, 2, 2, float('nan')])

In [26]: a & pd.Index([2, 2, np.nan])
Out[26]: Float64Index([2.0, 2.0], dtype='float64')

In [27]: a & pd.Index([2, np.nan])
Out[27]: Float64Index([2.0, 2.0, nan], dtype='float64')

@TomAugspurger
Copy link
Contributor

Proposal for duplicate set ops

intersection

The output of A & B should be the minimum count of occurrences between A and B

>>> pd.Index(['0', '0', '1', '1']).intersection(pd.Index(['0', '1', '1', '2']))
Index(['0', '1', '1'])

Union

The output of A | B should be the maximum count of occurrences between A and B

>>> pd.Index(['0', '0', '1', '1']).intersection(pd.Index(['0', '1', '1', '2']))
Index(['0', '0', '1', '1', '2'])

This matches the definitions on wikipedia on in http://multiset.readthedocs.io/en/stable/

@tscheburaschka
Copy link

tscheburaschka commented Feb 26, 2021

Hello,
given the broad scope of the issue, I am not sure whether I should add to this or to open up a new ticket.
I'll add, once I'm here.

When Using MultiIndex with Categorical levels, I found that the Categorical dtype is lost during set operations like
difference, union, symmetric_difference or intersection. This does not happen with a flat categorical index.

The following code shows the effect:

import pandas as pd
pd.show_versions()

cat = pd.CategoricalDtype(categories=[1, 2, 3, 4])
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 2, 1], 'b': [4, 3, 2, 3, 1, 2, 3]}, dtype=cat)

# flat categorical index, no MultiIndex
df_cat_ind = df.set_index('a', drop=False)
diff_ind = df_cat_ind.index.difference(df_cat_ind.head(3).index)
print(diff_ind.dtype)  # -> category

diff_ind = df_cat_ind.index.union(df_cat_ind.head(3).index)
print(diff_ind.dtype)  # -> category


# same set operations on MultiIndex, categorical gets lost
df_cat_multi_ind = df.set_index(['a', 'b'], drop=False)
# just make sure, we have a categorical in the first place
print(df_cat_multi_ind.index.get_level_values(0).dtype)  # -> category
print(df_cat_multi_ind.index.get_level_values(1).dtype)  # -> category


diff_multi_ind = df_cat_multi_ind.index.difference(df_cat_multi_ind.head(3).index)
print(diff_multi_ind.get_level_values(0).dtype)  # -> int64
print(diff_multi_ind.get_level_values(1).dtype)  # -> int64

diff_multi_ind = df_cat_multi_ind.index.union(df_cat_multi_ind.head(3).index)
print(diff_multi_ind.get_level_values(0).dtype)  # -> int64
print(diff_multi_ind.get_level_values(1).dtype)  # -> int64

So, the categorical dtype is lost, although the two sets are based on the identical categories (even the same instance).
Should this be fixed or is the usage of categorical dtype in MultiIndex to far fetched? I actually found it a very convenient design, since it allows for memory-efficient indexing of large data-sets and I expected the set operations to be particularly efficient on categorical dtypes.
For our usage scenario it would be important to know, whether we can look forward to a fix or if this design path is a dead end for the foreseeable future.
Thank you for the great work anyway!

EDIT (added version output):

INSTALLED VERSIONS
------------------
commit           : 7d32926db8f7541c356066dcadabf854487738de
python           : 3.8.7.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.14393
machine          : AMD64
processor        : Intel64 Family 6 Model 85 Stepping 0, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : de_DE.cp1252

pandas           : 1.2.2
numpy            : 1.20.1
pytz             : 2021.1
...

@jreback
Copy link
Contributor

jreback commented Feb 26, 2021

try on master there has been a lots of working in this recently

pls open a new issue as well

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [33]: idx1 = pd.Index([0, 1, 'A', 'B'])
    ...: idx2 = pd.Index([0, 2, 'A', 'C'])
    ...:
    ...: idx1.difference(idx2)
Out[33]: Index([1, 'B'], dtype='object')

In [34]: idx1.symmetric_difference(idx2)
Out[34]: Index([1, 2, 'B', 'C'], dtype='object')

In [35]: pd.__version__
Out[35]: '1.3.0.dev0+1485.g6abb567cb1'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses labels May 1, 2021
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants