Skip to content

Index.difference performance #12044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Winand opened this issue Jan 15, 2016 · 6 comments
Closed

Index.difference performance #12044

Winand opened this issue Jan 15, 2016 · 6 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@Winand
Copy link
Contributor

Winand commented Jan 15, 2016

I need to append several big Series to a big categorical Series.
Trying to update categories FAST i've found out that Index.difference uses Python's set, which is slow on creating LARGE set (i have up to 500k categories and 1.3M values).
numpy's setdiff1 is more than an order of magnitude faster (as of datetime64 Categorical):

tmp_unique = tmp.unique()
new_cats = pd.Index(pd.np.setdiff1d(tmp_unique[~pd.isnull(tmp_unique)], to.cat.categories))

Not so fast:

new_cats = pd.Index(tmp_unique[~pd.isnull(tmp_unique)]).difference(to.cat.categories)
@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

can u show the creation of tmp

@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

and to

@Winand
Copy link
Contributor Author

Winand commented Jan 15, 2016

I've tried to implement this with string data and difference performed better. But setdiff1d is much better with datetime64[ns]

import pandas as pd, time

to=pd.Series(pd.DatetimeIndex(range(1000000))).astype('category')#pd.Series(("hello%d"%i for i in range(1000000))).astype('category')
cats = to.cat.categorical._categories.values
tmp=pd.Series(pd.DatetimeIndex(range(1000000, 1200000)))#pd.Series(("bye%d"%i for i in range(200000)))

tmp_unique = tmp.unique()
tmp_unique = tmp_unique[~pd.isnull(tmp_unique)]

_=time.clock()
new_cats = pd.Index(tmp_unique).difference(cats)
print("Index.difference: %.3fs"%(time.clock()-_))

_=time.clock()
new_cats = pd.Index(pd.np.setdiff1d(tmp_unique, cats))
print("np.setdiff1d: %.3fs"%(time.clock()-_))

-----
>>>Index.difference: 1.976s
>>>np.setdiff1d: 0.104s

@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

interesting, the is the first numpy setop that actually is fast.

ok, sure pull-requests are welcome (including an asv benchmark).

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Difficulty Intermediate labels Jan 15, 2016
@jreback jreback added this to the Next Major Release milestone Jan 15, 2016
@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

xref #11279

note that you would have to use base forms for some of the index types (e.g. .values)

so might need to upgrade tests for this (in fact should consolidate all of the test_difference* tests in test_index.py) and move them to Base for generic testing.

@max-sixty
Copy link
Contributor

@Winand I had a go at speeding this up, in the issue @jreback referenced. I didn't get it over the finish line, please do take the torch!

My understanding is that set is actually very fast. The slow part of the current implementation is the boxing & unboxing of values, for indexes that need to do conversions for each element for list(self). So if you can delegated to .values, then set should be reasonable. numpy may still be faster - worth comparing apples to apples

@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 27, 2016
pijucha added a commit to pijucha/pandas that referenced this issue Jul 17, 2016
1. Added an internal `safe_sort` to safely sort mixed-integer
arrays in Python3.

2. Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (pandas-dev#13432)
- improve performance (pandas-dev#12044)

3. Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814)

4. Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (pandas-dev#13432)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants