Skip to content

ENH: Add sym_diff for index #6016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 17, 2014

Conversation

TomAugspurger
Copy link
Contributor

Close #5543

If there's any interest for this in 0.14, here's the code.

Thoughts on the method name? A regular difference is diff so I want to be consistent with that. I figured it would be a bit weird to truncate difference to diff but not truncate symmetric to sym (or symm) so I went with sym_diff (python sets use symmetric_difference).

Also when writing the tests, I discovered that NaNs are even weirder than I thought:

In [20]: idx = pd.Index([np.nan, 1, np.nan])

In [21]: sorted(set(idx))
Out[21]: [nan, 1.0, nan]

Seems like neither set nor sorted did anything. Anyway my PR should be consistent with this weirdness.

@ghost
Copy link

ghost commented Jan 29, 2014

Index has set algebra operations, including union and difference. We could do this.
at least diff does not assume other is an Index, this shouldn't either.

@TomAugspurger
Copy link
Contributor Author

Good call on not requiring other to be an Index. I've also updated the diff docstring to say that it accepts an array like object.

Yahoo network errors are causing the failure. I'll retry later.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

@TomAugspurger if you rebase on master the network errors should go away...@y-p fixed this

@TomAugspurger
Copy link
Contributor Author

I guess I forgot to fetch before rebasing. I'll throw in a release notes entry once the .14 cycle starts.

@jorisvandenbossche
Copy link
Member

@TomAugspurger Two ideas for the documentation of someone not so familiar with set terminology:

  • Could you add short explanation on what the function does, apart from "(symmetric) set difference". Eg something like "the elements in the index that are not in other" and "the elements which are in both sets but not in their intersection"
  • And maybe you could also add a small example with output in the docstring, so you can see in a glance what it does.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

@TomAugspurger FYI, you should enable the operator '^'

e.g. s ^ t is a standard symmetric diff op

see here: http://docs.python.org/2/library/sets.html (you define __xor__ to do this)

@TomAugspurger
Copy link
Contributor Author

Thanks. @jorisvandenbossche added more to notes and an example.

@jreback I didn't know that python used ^ at all. Cool.

@jorisvandenbossche
Copy link
Member

@TomAugspurger Thanks for the clarifications!

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@TomAugspurger looks good...pls add release notes / maybe add that you can do this in indexing section (where the descriptin of the Index internal methods are - near the bottom)

update release and docs
@TomAugspurger
Copy link
Contributor Author

@jreback rebased and added notes and docs. Should be good to go.

jreback added a commit that referenced this pull request Feb 17, 2014
@jreback jreback merged commit 026ea8c into pandas-dev:master Feb 17, 2014
@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

gr8!

pls review the release notes/v0.14.0 after this is built this is built when master processes this PR: http://pandas-docs.github.io/pandas-docs-travis/

thanks!

@TomAugspurger
Copy link
Contributor Author

Will do. Thanks.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

@TomAugspurger

this test fails on python 3.4 (only tested on windows atm)

any ideas?

 ======================================================================
FAIL: test_symmetric_diff (pandas.tests.test_index.TestIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-3.4\pandas\tests\test_index.py", line 501, in test_symmetric_diff
    self.assert_(pd.isnull(result[nans]).all())
AssertionError: False is not true

----------------------------------------------------------------------
Ran 4771 tests in 139.262s

FAILED (SKIP=174, failures=1)

C:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win32-3.4 [master]> cd ..
C:\Users\Jeff Reback\Documents\GitHub\pandas\build [master]> cd ..
C:\Users\Jeff Reback\Documents\GitHub\pandas [master]> C:\python34-32\Scripts\nosetests.exe .\build\lib.win32-3.4\pandas\tests\test_index.py --pdb --p
db-failure
..........................................................> c:\python34-32\lib\unittest\case.py(651)assertTrue()
-> raise self.failureException(msg)
(Pdb) u
> c:\python34-32\lib\unittest\case.py(1287)deprecated_func()
-> return original_func(*args, **kwargs)
(Pdb) u
> c:\users\jeff reback\documents\github\pandas\build\lib.win32-3.4\pandas\tests\test_index.py(501)test_symmetric_diff()
-> self.assert_(pd.isnull(result[nans]).all())
(Pdb) l
496             idx1 = Index([1, 2, np.nan])
497             idx2 = Index([0, 1, np.nan])
498             result = idx1.sym_diff(idx2)
499             expected = Index([0.0, np.nan, 2.0, np.nan])  # oddness with nans
500             nans = pd.isnull(expected)
501  ->         self.assert_(pd.isnull(result[nans]).all())
502             self.assert_(tm.equalContents(result[~nans], expected[~nans]))
503
504             # other not an Index:
505             idx1 = Index([1, 2, 3, 4], name='idx1')
506             idx2 = np.array([2, 3, 4, 5])
(Pdb) p nans
array([False,  True, False,  True], dtype=bool)
(Pdb) p result
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb) p result[nans]
Float64Index([nan, 2.0], dtype='object')
(Pdb) p idx1
Float64Index([1.0, 2.0, nan], dtype='object')
(Pdb) p idx2
Float64Index([0.0, 1.0, nan], dtype='object')
(Pdb) p idx1.sym_diff(idx2)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb) p idx1
Float64Index([1.0, 2.0, nan], dtype='object')
(Pdb) p idx2
Float64Index([0.0, 1.0, nan], dtype='object')
(Pdb) p idx1.sym_diff(idx2)
Float64Index([0.0, nan, nan, 2.0], dtype='object')
(Pdb)

@TomAugspurger
Copy link
Contributor Author

Strange. I'll get a 3.4 env setup and see what's going on.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

note...that we don't test on travis 3.4

Here's my setup of 34-32 (34-64 fails too)
ths is just FYI...I think key is numpy 1.8/3.4

C:\Builds>c:/python27-64/python.exe c:/Builds/check_and_build.py -b 34-32 -v
2014-02-17 14:46:53,487: 34-32 :
2014-02-17 14:46:53,487: 34-32 : INSTALLED VERSIONS
2014-02-17 14:46:53,489: 34-32 : ------------------
2014-02-17 14:46:53,489: 34-32 : commit: None
2014-02-17 14:46:53,489: 34-32 : python: 3.4.0.candidate.1
2014-02-17 14:46:53,489: 34-32 : python-bits: 32
2014-02-17 14:46:53,489: 34-32 : OS: Windows
2014-02-17 14:46:53,489: 34-32 : OS-release: 7
2014-02-17 14:46:53,489: 34-32 : machine: AMD64
2014-02-17 14:46:53,490: 34-32 : processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
2014-02-17 14:46:53,490: 34-32 : byteorder: little
2014-02-17 14:46:53,490: 34-32 : LC_ALL: None
2014-02-17 14:46:53,490: 34-32 : LANG: None
2014-02-17 14:46:53,490: 34-32 :
2014-02-17 14:46:53,490: 34-32 : pandas: None
2014-02-17 14:46:53,490: 34-32 : Cython: 0.20.1
2014-02-17 14:46:53,490: 34-32 : numpy: 1.8.0
2014-02-17 14:46:53,490: 34-32 : scipy: 0.13.3
2014-02-17 14:46:53,490: 34-32 : statsmodels: 0.5.0
2014-02-17 14:46:53,490: 34-32 : IPython: None
2014-02-17 14:46:53,490: 34-32 : sphinx: None
2014-02-17 14:46:53,490: 34-32 : patsy: 0.2.1
2014-02-17 14:46:53,490: 34-32 : scikits.timeseries: None
2014-02-17 14:46:53,490: 34-32 : dateutil: 2.2
2014-02-17 14:46:53,492: 34-32 : pytz: 2013.9
2014-02-17 14:46:53,492: 34-32 : bottleneck: 0.8.0
2014-02-17 14:46:53,492: 34-32 : tables: 3.1.0
2014-02-17 14:46:53,492: 34-32 : numexpr: 2.3
2014-02-17 14:46:53,492: 34-32 : matplotlib: 1.3.1
2014-02-17 14:46:53,492: 34-32 : openpyxl: 1.8.3
2014-02-17 14:46:53,492: 34-32 : xlrd: 0.9.2
2014-02-17 14:46:53,493: 34-32 : xlwt: None
2014-02-17 14:46:53,493: 34-32 : xlsxwriter: 0.5.2
2014-02-17 14:46:53,493: 34-32 : lxml: None
2014-02-17 14:46:53,493: 34-32 : bs4: None
2014-02-17 14:46:53,493: 34-32 : html5lib: None
2014-02-17 14:46:53,493: 34-32 : bq: None
2014-02-17 14:46:53,493: 34-32 : apiclient: None
2014-02-17 14:46:53,493: 34-32 : rpy2: None
2014-02-17 14:46:53,493: 34-32 : sqlalchemy: 0.9.2
2014-02-17 14:46:53,494: 34-32 : pymysql: None
2014-02-17 14:46:53,494: 34-32 : psycopg2: None

@sinhrks
Copy link
Member

sinhrks commented Oct 31, 2015

@TomAugspurger Do you remember the background of adding result_name kw? Other set ops doesn't have the kw.

Should be added to others if it's useful, otherwise deprecate?

@TomAugspurger
Copy link
Contributor Author

Huh, I have no idea. I don't know that I've ever used it, so I'm not sure why I included that.

@TomAugspurger TomAugspurger deleted the symmetric-difference branch May 15, 2017 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: symmetric_difference for Index
4 participants