Skip to content

df.filter(like='col_name') 2.975X slower than basic column list comprehension #5657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragoljub opened this issue Dec 6, 2013 · 2 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@dragoljub
Copy link

I have found that using the filter method to select columns that match a string pattern is ~3x slower than basic list comprehension on the df.columns list. Not sure how its implemented under the hood but for basic 'in' checks on lots of columns this could slow you down depending on how often you filter.

import pandas as pd
import numpy as np

# Generate Test DataFrame
NUM_ROWS = 2000
NUM_COLS = 1000
col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
df['TEST'] = 0
df['TEST2'] = 0

%timeit df.filter(like='TEST')
1000 loops, best of 3: 1.19 ms per loop

%timeit df[[col for col in df.columns if 'TEST' in col]]
1000 loops, best of 3: 400 µs per loop

%time df.filter(like='TEST')
Wall time: 1 ms
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST     2000  non-null values
TEST2    2000  non-null values
dtypes: int64(2)

%time df[[col for col in df.columns if 'TEST' in col]]
Wall time: 1 ms
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST     2000  non-null values
TEST2    2000  non-null values
dtypes: int64(2)

pd.__version__
Out[7]: '0.12.0'

np.__version__
Out[8]: '1.7.1'
@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

This is not a simple 'in' check, it is actually matching a regex; pls compare against that (you can match with 'in' with isin). put up that timing too.

@dragoljub
Copy link
Author

df.columns.isin() requires you to explicitly enumerate a list of patterns you want to check against. Its still ~20% slower than the list comprehension. From what I can tell if you are doing a column select that checks for a sub string 'TEST' its faster and easier to do the list comprehension. If you do lots of selections like this you may hit the slowdown of loading the re library.

I'm not sure what most people use when filtering columns but for me its usually simply checking if a column contains a sub string. In that case, 3x speedup is nice to see from a simple implementation. Not sure if there is a way to check if a regex is being passed and if not use the faster code.

%timeit df[df.columns[df.columns.isin(['TEST','TEST2'])]]
1000 loops, best of 3: 468 µs per loop

%timeit df[[col for col in df.columns if 'TEST' in col]]
1000 loops, best of 3: 378 µs per loop

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants