-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Index.difference performance #12044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u show the creation of tmp |
and to |
I've tried to implement this with string data and
|
interesting, the is the first numpy setop that actually is fast. ok, sure pull-requests are welcome (including an asv benchmark). |
xref #11279 note that you would have to use base forms for some of the index types (e.g. so might need to upgrade tests for this (in fact should consolidate all of the |
@Winand I had a go at speeding this up, in the issue @jreback referenced. I didn't get it over the finish line, please do take the torch! My understanding is that |
1. Added an internal `safe_sort` to safely sort mixed-integer arrays in Python3. 2. Changed Index.difference and Index.symmetric_difference in order to: - sort mixed-int Indexes (pandas-dev#13432) - improve performance (pandas-dev#12044) 3. Fixed DataFrame.join which raised in Python3 with mixed-int non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814) 4. Fixed Index.union returning an empty Index when one of arguments was a named empty Index (pandas-dev#13432)
I need to append several big Series to a big categorical Series.
Trying to update categories FAST i've found out that
Index.difference
uses Python'sset
, which is slow on creating LARGE set (i have up to 500k categories and 1.3M values).numpy's
setdiff1
is more than an order of magnitude faster (as of datetime64 Categorical):Not so fast:
The text was updated successfully, but these errors were encountered: