-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Performance regression (memory and time) on set_index method between 0.20.3 and 0.24.2 #26108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you could provide a minimal code sample would be easier to take a look: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Sure ! # Defined function for recording memory usage
def set_index_on_all_cols(data_frame: pd.DataFrame):
return data_frame.set_index(data_frame.columns.tolist())
df_size = int(1e7)
unique_len = int(1e5)
# New data frame
df = pd.DataFrame(
data={
"col1": np.tile(np.arange(unique_len), df_size // unique_len),
"col2": np.tile(np.arange(unique_len), df_size // unique_len),
"col3": np.tile(np.arange(unique_len), df_size // unique_len),
"col4": np.tile(np.arange(unique_len), df_size // unique_len)
}
)
mem = memory_usage((set_index_on_all_cols, (df, )), interval=1e7)
print("Min mem usage {0}".format(min(mem)))
print("Max mem usage {0}".format(max(mem)))
print("Diff mem usage {0}".format(max(mem) - min(mem))) This gives me for pandas 0.20.3 : |
And here for tracking spent time on indexing : df_size = int(1e7)
unique_len = int(1e5)
# New data frame
df = pd.DataFrame(
data={
"col1": np.tile(np.arange(unique_len), df_size // unique_len),
"col2": np.tile(np.arange(unique_len), df_size // unique_len),
"col3": np.tile(np.arange(unique_len), df_size // unique_len),
"col4": np.tile(np.arange(unique_len), df_size // unique_len)
}
)
%timeit idf = df.set_index(df.columns.tolist()) For pandas 0.20.3 : 3.25s |
After have dived into pandas code, i've found that this huge time and memory overhead happens when we access to the _engine property of pandas Index classes. This only concerns the first access to this property because its result is cached after have been computed once. This property is called in pandas set_index method but we can avoid its call if we create our index with MultiIndex.from_arrays and assign it to the data frame. And so, I've changed my tests for seperating index creation and index engine creation (first access to _engine property). For memory usage diff : # Defined functions for recording memory usage
def set_index_on_all_cols(data_frame: pd.DataFrame):
df = data_frame.copy()
columns = df.columns.tolist()
arrays = [df[col] for col in columns]
index = pd.MultiIndex.from_arrays(arrays=arrays, names=columns)
df.index = index
return df.drop(labels=columns, axis=1)
def access_to_engine(data_frame: pd.DataFrame):
engine = data_frame.index._engine
return engine
df_size = int(1e7)
unique_len = int(1e5)
# New data frame
df = pd.DataFrame(
data={
"col1": np.tile(np.arange(unique_len), df_size // unique_len),
"col2": np.tile(np.arange(unique_len), df_size // unique_len),
"col3": np.tile(np.arange(unique_len), df_size // unique_len),
"col4": np.tile(np.arange(unique_len), df_size // unique_len)
}
)
print("INDEX CREATION")
mem1, res = memory_usage((set_index_on_all_cols, (df, )), interval=1e-8, retval=True)
print("Min mem usage {0} MB".format(min(mem1)))
print("Max mem usage {0} MB".format(max(mem1)))
print("Diff mem usage {0} MB".format(max(mem1) - min(mem1)))
print("INDEX ENGINE CREATION")
mem2, engine = memory_usage((access_to_engine, (res, )), interval=1e-8, retval=True)
print("Min mem usage {0} MB".format(min(mem2)))
print("Max mem usage {0} MB".format(max(mem2)))
print("Diff mem usage {0} MB".format(max(mem2) - min(mem2))) Output for pandas 0.20.3 : Output for pandas 0.24.2 : |
And same for spent time : def set_index_on_all_cols(data_frame: pd.DataFrame):
df = data_frame.copy()
columns = df.columns.tolist()
arrays = [df[col] for col in columns]
index = pd.MultiIndex.from_arrays(arrays=arrays, names=columns)
df.index = index
return df.drop(labels=columns, axis=1)
df_size = int(1e7)
unique_len = int(1e5)
# New data frame
df = pd.DataFrame(
data={
"col1": np.tile(np.arange(unique_len), df_size // unique_len),
"col2": np.tile(np.arange(unique_len), df_size // unique_len),
"col3": np.tile(np.arange(unique_len), df_size // unique_len),
"col4": np.tile(np.arange(unique_len), df_size // unique_len)
}
)
print("INDEX CREATION")
%timeit idf = set_index_on_all_cols(df)
idf = set_index_on_all_cols(df)
print("INDEX ENGINE CREATION")
%time idf.index._engine Output for pandas 0.20.3 : Output for pandas 0.24.2 : |
I've taken a quick look at this and the "kink" in all those plots occurs when the engine overflows The short term fix would be to check if you have any duplicative index levels. Any duplicate levels would be best left as a column in the current implementation (but I think we could be smart and de-dupe for cheap in the future). |
Thanks for your answer ! Unfortunately, we do not have duplicated index levels in our dataframe but we have a lot of columns we put in data frame index. |
Hi @Ahrimanox - at 25 levels of depth, you can have at most 5 rows before you hit this performance/memory issue. Having a huge number of unique values is actually great, as you do want the index to be unique. However, once you have uniqueness, every extra level of the multiindex is just a burden. If there are levels you never select or join on, they should definitely be columns. |
Given this issue is with an unsupported version of pandas now, going to close. |
Code Sample, a copy-pastable example if possible
Problem description
For a bit of context : I'm working on a mid-range ETL software based on pandas and recently, we decided to upgrade our pandas version to the latest one : 0.24.2.
We're working with 0.20.3 version and since we've passed to the latest, set_index method seems to be responsible of big performance losses on our soft.
Setting index now takes ~2 times as long compared with the older version and use more memory.
After making some research about indexing changes between these two versions, I've found that indexing in pandas has changed from the 0.23 version : #19074
I've made a bunch of tests to understand what could be the cause of this overhead.
One type of tests i've made shows a big difference between the two version.
For this test, I've used a data frame of 1e7 size. This test consisting of iteratively adding a new integer column to the data frame and then, trying to call set_index method with every columns as index cols.
The code is expressed above. I've made some variations to it by changing the ratio : number of unique values / size of the column.
For each set_index call, i've recorded min and max memory usage and spent time on indexing.
First, I've concluded that the time and memory usage difference between these two versions becomes larger when we want to index on many columns.
Even if this test is non-representative, we clearly see that the difference increases with the number of columns we're indexing on.
This may be also caused by :
If I reduce the number of unique values to 100, memory usage peaks and spent time (for the two versions) are :
The above plot shows that after some point, pandas seems to change its indexing method. I'm not sure about that but it seems to be.
So, i've tried with a more representative case according to our use cases (but always with integer columns) :
Same pattern for memory usages and spent time.
According to my tests, indexing in newer pandas version takes a lot of memory compared to the older version and takes more time to do the same thing. Even if recording memory could affect performance, difference is still visible.
This difference seems growing with how long the unique values set is for each column.
Indexing is a functionality we use a lot where i'm working and we often put 15-30 columns in the index and just a few in columns (1-5). Some of our columns we're indexing on may contain more than 10e4 unique values.
Maybe, it is not a good practice to index on so many columns, but it is really convenient to represent and explain data in high dimensions. Selections in data frame are also faster. Maybe, we're doing wrong ?
Thanks in advance for your help or your suggestions.
All tests were made in two exactly environnements except for the pandas version.
I put pd.show_versions() output of this two environnements in 'details' section.
For information, is used numpy+mkl 1.16.2 got from :
https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS (0.24.2)
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None
INSTALLED VERSIONS (0.20.3)
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
None
The text was updated successfully, but these errors were encountered: