Skip to content

PERF: Performance regression (memory and time) on set_index method between 0.20.3 and 0.24.2 #26108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ahrimanox opened this issue Apr 16, 2019 · 9 comments
Labels
Performance Memory or execution speed performance

Comments

@Ahrimanox
Copy link

Code Sample, a copy-pastable example if possible

# Method used to set index of data frame (defined for make time-memory recording possible)
def set_index(data_frame: pd.DataFrame, index_cols: list):
    time_res = %timeit -n 1 -r 1 -o data_frame.set_index(index_cols)
    return time_res

# Method that appends N-integers columns and call set_index while recording its memory usage
def insert_cols_and_index(data_frame: pd.DataFrame, col_number: int, values: np.ndarray):
    idf = data_frame.copy()
    index_cols = [None] * col_number
    for i in range(col_number):
        current_col_name = str(i)
        index_cols[i] = current_col_name
        idf[current_col_name] = values
    
    # Record memory usage and get spent time on indexing
    return memory_usage((set_index, (idf, index_cols)), interval=1e-7, retval=True)

# Values we'll use as new columns content
df_size = int(1e7)
unique_len = int(1e7)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)

# Create an empty pandas data frame
df = pd.DataFrame()

# Make iterative add of integer columns
# Set index on all data frame columns and record memory usage peaks
max_col_num = 14
min_mem_usages = np.zeros(max_col_num)
max_mem_usages = np.zeros(max_col_num)
time_usages = np.zeros(max_col_num)
for i in range(max_col_num):
    print("{0} iteration(s)".format(i+1))
    mem, time = insert_cols_and_index(df, i + 1, int_array)
    min_mem_usages[i] = min(mem)
    max_mem_usages[i] = max(mem)
    time_usages[i] = time.average

np.save("min_mem_usages_" + pd.__version__ + "_" + str(unique_len), min_mem_usages)
np.save("max_mem_usages_" + pd.__version__ + "_" + str(unique_len), max_mem_usages)
np.save("time_usages_" + pd.__version__ + "_" + str(unique_len), time_usages)

Problem description

For a bit of context : I'm working on a mid-range ETL software based on pandas and recently, we decided to upgrade our pandas version to the latest one : 0.24.2.

We're working with 0.20.3 version and since we've passed to the latest, set_index method seems to be responsible of big performance losses on our soft.

Setting index now takes ~2 times as long compared with the older version and use more memory.

After making some research about indexing changes between these two versions, I've found that indexing in pandas has changed from the 0.23 version : #19074

I've made a bunch of tests to understand what could be the cause of this overhead.

One type of tests i've made shows a big difference between the two version.
For this test, I've used a data frame of 1e7 size. This test consisting of iteratively adding a new integer column to the data frame and then, trying to call set_index method with every columns as index cols.
The code is expressed above. I've made some variations to it by changing the ratio : number of unique values / size of the column.
For each set_index call, i've recorded min and max memory usage and spent time on indexing.

First, I've concluded that the time and memory usage difference between these two versions becomes larger when we want to index on many columns.

df_size = int(1e7)
unique_len = int(1e7)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)

mem_cmp
time_use

Even if this test is non-representative, we clearly see that the difference increases with the number of columns we're indexing on.
This may be also caused by :

  • the number of combinations of unique values along index levels (product of length of unique values for each column)
  • the number of unique values for each index level (sum of length of unique values for each column)

If I reduce the number of unique values to 100, memory usage peaks and spent time (for the two versions) are :

df_size = int(1e7)
unique_len = int(1e2)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)

mem_cmp
time_use

The above plot shows that after some point, pandas seems to change its indexing method. I'm not sure about that but it seems to be.

So, i've tried with a more representative case according to our use cases (but always with integer columns) :

df_size = int(1e7)
unique_len = int(1e4)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)

mem_cmp
time_use

Same pattern for memory usages and spent time.

According to my tests, indexing in newer pandas version takes a lot of memory compared to the older version and takes more time to do the same thing. Even if recording memory could affect performance, difference is still visible.

This difference seems growing with how long the unique values set is for each column.

Indexing is a functionality we use a lot where i'm working and we often put 15-30 columns in the index and just a few in columns (1-5). Some of our columns we're indexing on may contain more than 10e4 unique values.
Maybe, it is not a good practice to index on so many columns, but it is really convenient to represent and explain data in high dimensions. Selections in data frame are also faster. Maybe, we're doing wrong ?

Thanks in advance for your help or your suggestions.
All tests were made in two exactly environnements except for the pandas version.
I put pd.show_versions() output of this two environnements in 'details' section.
For information, is used numpy+mkl 1.16.2 got from :
https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS (0.24.2)

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None

INSTALLED VERSIONS (0.20.3)

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0

xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None

pandas_gbq: None
pandas_datareader: None

None

@WillAyd
Copy link
Member

WillAyd commented Apr 17, 2019

If you could provide a minimal code sample would be easier to take a look:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@WillAyd WillAyd added Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance labels Apr 17, 2019
@Ahrimanox
Copy link
Author

Ahrimanox commented Apr 18, 2019

Sure !
Here you can find a minimal example for tracking memory usage diff.

# Defined function for recording memory usage
def set_index_on_all_cols(data_frame: pd.DataFrame):
    return data_frame.set_index(data_frame.columns.tolist())

df_size = int(1e7)
unique_len = int(1e5)

# New data frame
df = pd.DataFrame(
    data={
        "col1": np.tile(np.arange(unique_len), df_size // unique_len),
        "col2": np.tile(np.arange(unique_len), df_size // unique_len),
        "col3": np.tile(np.arange(unique_len), df_size // unique_len),
        "col4": np.tile(np.arange(unique_len), df_size // unique_len)
    }
)

mem = memory_usage((set_index_on_all_cols, (df, )), interval=1e7)
print("Min mem usage {0}".format(min(mem)))
print("Max mem usage {0}".format(max(mem)))
print("Diff mem usage {0}".format(max(mem) - min(mem)))

This gives me for pandas 0.20.3 :
min = 1050 MB, max = 1569 MB and diff (max-min) = 519 MB
and with 0.24.2 :
min = 1062 MB, max = 4852 MB and diff (max-min) = 3790 MB

@Ahrimanox
Copy link
Author

Ahrimanox commented Apr 18, 2019

And here for tracking spent time on indexing :

df_size = int(1e7)
unique_len = int(1e5)

# New data frame
df = pd.DataFrame(
    data={
        "col1": np.tile(np.arange(unique_len), df_size // unique_len),
        "col2": np.tile(np.arange(unique_len), df_size // unique_len),
        "col3": np.tile(np.arange(unique_len), df_size // unique_len),
        "col4": np.tile(np.arange(unique_len), df_size // unique_len)
    }
)


%timeit idf = df.set_index(df.columns.tolist())

For pandas 0.20.3 : 3.25s
with 0.24.2 : 7.35s

@Ahrimanox
Copy link
Author

Ahrimanox commented May 9, 2019

After have dived into pandas code, i've found that this huge time and memory overhead happens when we access to the _engine property of pandas Index classes. This only concerns the first access to this property because its result is cached after have been computed once.

This property is called in pandas set_index method but we can avoid its call if we create our index with MultiIndex.from_arrays and assign it to the data frame.

And so, I've changed my tests for seperating index creation and index engine creation (first access to _engine property).

For memory usage diff :

# Defined functions for recording memory usage
def set_index_on_all_cols(data_frame: pd.DataFrame):
    df = data_frame.copy()
    columns = df.columns.tolist()
    arrays = [df[col] for col in columns]
    
    index = pd.MultiIndex.from_arrays(arrays=arrays, names=columns)
    df.index = index
    
    return df.drop(labels=columns, axis=1)

def access_to_engine(data_frame: pd.DataFrame):
    engine = data_frame.index._engine
    return engine

df_size = int(1e7)
unique_len = int(1e5)
    
# New data frame
df = pd.DataFrame(
    data={
        "col1": np.tile(np.arange(unique_len), df_size // unique_len),
        "col2": np.tile(np.arange(unique_len), df_size // unique_len),
        "col3": np.tile(np.arange(unique_len), df_size // unique_len),
        "col4": np.tile(np.arange(unique_len), df_size // unique_len)
    }
)

print("INDEX CREATION")
mem1, res = memory_usage((set_index_on_all_cols, (df, )), interval=1e-8, retval=True)
print("Min mem usage {0} MB".format(min(mem1)))
print("Max mem usage {0} MB".format(max(mem1)))
print("Diff mem usage {0} MB".format(max(mem1) - min(mem1)))

print("INDEX ENGINE CREATION")
mem2, engine = memory_usage((access_to_engine, (res, )), interval=1e-8, retval=True)
print("Min mem usage {0} MB".format(min(mem2)))
print("Max mem usage {0} MB".format(max(mem2)))
print("Diff mem usage {0} MB".format(max(mem2) - min(mem2)))

Output for pandas 0.20.3 :
INDEX CREATION
Min mem usage 247.7890625 MB
Max mem usage 876.41796875 MB
Diff mem usage 628.62890625 MB
INDEX ENGINE CREATION
Min mem usage 413.9609375 MB
Max mem usage 413.9609375 MB
Diff mem usage 0.0 MB

Output for pandas 0.24.2 :
INDEX CREATION
Min mem usage 251.66796875 MB
Max mem usage 759.203125 MB
Diff mem usage 507.53515625 MB
INDEX ENGINE CREATION
Min mem usage 418.99609375 MB
Max mem usage 3971.16015625 MB
Diff mem usage 3552.1640625 MB

@Ahrimanox
Copy link
Author

And same for spent time :

def set_index_on_all_cols(data_frame: pd.DataFrame):
    df = data_frame.copy()
    columns = df.columns.tolist()
    arrays = [df[col] for col in columns]
    
    index = pd.MultiIndex.from_arrays(arrays=arrays, names=columns)
    df.index = index
    
    return df.drop(labels=columns, axis=1)

df_size = int(1e7)
unique_len = int(1e5)

# New data frame
df = pd.DataFrame(
    data={
        "col1": np.tile(np.arange(unique_len), df_size // unique_len),
        "col2": np.tile(np.arange(unique_len), df_size // unique_len),
        "col3": np.tile(np.arange(unique_len), df_size // unique_len),
        "col4": np.tile(np.arange(unique_len), df_size // unique_len)
    }
)
print("INDEX CREATION")
%timeit idf = set_index_on_all_cols(df)

idf = set_index_on_all_cols(df)

print("INDEX ENGINE CREATION")
%time idf.index._engine

Output for pandas 0.20.3 :
INDEX CREATION
2.36 s ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
INDEX ENGINE CREATION
Wall time: 0 ns

Output for pandas 0.24.2 :
INDEX CREATION
1.96 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
INDEX ENGINE CREATION
Wall time: 5.12 s

@qwhelan
Copy link
Contributor

qwhelan commented Jul 21, 2019

I've taken a quick look at this and the "kink" in all those plots occurs when the engine overflows uint64 and switches to pure Python ints, giving correctness at the cost of time and memory. My strong suspicion is that the 0.20 behavior is potentially incorrect due to such overflows.

The short term fix would be to check if you have any duplicative index levels. Any duplicate levels would be best left as a column in the current implementation (but I think we could be smart and de-dupe for cheap in the future).

@Ahrimanox
Copy link
Author

Thanks for your answer !

Unfortunately, we do not have duplicated index levels in our dataframe but we have a lot of columns we put in data frame index.
Between 10 and 25 index levels in average case.
Some of these levels have an huge bunch of unique values.

@qwhelan
Copy link
Contributor

qwhelan commented Aug 28, 2019

Hi @Ahrimanox - at 25 levels of depth, you can have at most 5 rows before you hit this performance/memory issue.

Having a huge number of unique values is actually great, as you do want the index to be unique. However, once you have uniqueness, every extra level of the multiindex is just a burden. If there are levels you never select or join on, they should definitely be columns.

@mroeschke mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Mar 8, 2020
@mroeschke
Copy link
Member

Given this issue is with an unsupported version of pandas now, going to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants