Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Slowdowns with .isin() on columns typed as np.uint64 #60098

Open
2 of 3 tasks
adrian17 opened this issue Oct 24, 2024 · 0 comments
Open
2 of 3 tasks

PERF: Slowdowns with .isin() on columns typed as np.uint64 #60098

adrian17 opened this issue Oct 24, 2024 · 0 comments
Labels
isin isin method Performance Memory or execution speed performance

Comments

@adrian17
Copy link

adrian17 commented Oct 24, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd, numpy as np
data = pd.DataFrame({
    "uints": np.random.randint(10000, size=300000, dtype=np.uint64),
    "ints": np.random.randint(10000, size=300000, dtype=np.int64),
})

%timeit data["ints"].isin([1, 2]) # 3ms
%timeit data["ints"].isin(np.array([1, 2], dtype=np.int64)) # 3ms
%timeit data["ints"].isin(np.array([1, 2], dtype=np.uint64)) # 5ms
%timeit data["ints"].isin([np.int64(1), np.int64(2)]) # 3ms
%timeit data["ints"].isin([np.uint64(1), np.uint64(2)]) # 5ms

%timeit data["uints"].isin([1, 2]) # 14ms (!)
%timeit data["uints"].isin(np.array([1, 2], dtype=np.int64)) # 5ms
%timeit data["uints"].isin(np.array([1, 2], dtype=np.uint64)) # 3ms
%timeit data["uints"].isin([np.int64(1), np.int64(2)])  # 17ms (!)
%timeit data["uints"].isin([np.uint64(1), np.uint64(2)]) # 17ms (!)

The last line, with older numpy==1.26.4 (last version <2.0), is even worse: ~200ms.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.12
python-bits : 64
OS : Linux
OS-release : 6.5.0-27-generic
Version : #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.2

Prior Performance

With pandas 1.4.4 and numpy 1.26.4, all the benchmarks above show 3-5ms (3ms on signedness match, 5ms on signedness mismatch). So despite updating numpy mitigating the worst 200ms regression, this still looks like a 5x performance regression on pandas side since 1.4.4.

I'm guessing the regression could be related to PR #46693 , which happened on the 1.5.0 release.

@adrian17 adrian17 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 24, 2024
@rhshadrach rhshadrach added isin isin method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
isin isin method Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants