Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance regression in Series.xs() and DataFrame.xs() with MultiIndex. #35188

Closed
taozuoqiao opened this issue Jul 9, 2020 · 4 comments · Fixed by #46233
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@taozuoqiao
Copy link

Performance of Series.xs() when getting values at specified index and the first level of a large dataset seems to have regressed significantly, after pandas was upgraded to v1.0.5 from v0.25.3. So is DataFrame.xs().

code:

import numpy as np
import pandas as pd
n=1e4
data = pd.Series(np.arange(n**2), index=pd.MultiIndex.from_product([np.arange(n), -np.arange(n)]))
%timeit data.xs(100, level=0) #first level
%timeit data.xs(-100, level=1) #second level

outcome

## 0.25.3
%timeit data.xs(100, level=0)
571 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit data.xs(-100, level=1)
193 ms ± 947 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## 1.0.5
%timeit data.xs(100, level=0)
683 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

%timeit data.xs(-100, level=1)
197 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jul 9, 2020
@simonjayhawkins
Copy link
Member

it appears #29840 is responsible for the slowdown here

9dd1b50 is the first bad commit
commit 9dd1b50
Author: jbrockmendel jbrockmendel@gmail.com
Date: Fri Nov 29 15:10:02 2019 -0800

DEPR: remove FrozenNDarray (#29840)

cc @jbrockmendel

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 21, 2020
@lukemanley
Copy link
Member

I think this can be closed.

## main

%timeit data.xs(100, level=0)
198 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit data.xs(-100, level=1)
82.3 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@lukemanley lukemanley added the Closing Candidate May be closeable, needs more eyeballs label Mar 4, 2022
@jreback
Copy link
Contributor

jreback commented Mar 4, 2022

yep
do we have covering asvs for these cases?

@lukemanley
Copy link
Member

do we have covering asvs for these cases?

ASVs added in linked PR

@jreback jreback closed this as completed Mar 4, 2022
@jreback jreback added this to the 1.5 milestone Mar 4, 2022
@jreback jreback reopened this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants