PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

topper-123 · 2023-02-15T23:32:39Z

closes PERF: concat slow, manual concat through reindexing enhances performance #50652
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

faster concatenation for misaligned float dataframes.

>>> from itertools import product
>>> import numpy as np
>>> import pandas as pd
>>> from pandas.core.reshape.concat import _Concatenator
>>>
>>> def manual_concat(df_list: list[pd.DataFrame]) -> pd.DataFrame:
...     columns = [col for df in df_list for col in df.columns]
...     columns = list(dict.fromkeys(columns))
...     index = np.hstack([df.index.values for df in df_list])
...     df_list = [df.reindex(columns=columns) for df in df_list]
...     values = np.vstack([df.values for df in df_list])
...     return pd.DataFrame(values, index=index, columns=columns, dtype=df_list[0].dtypes[0])
>>>
>>> def compare_frames(df_list: list[pd.DataFrame]) -> None:
...     concat_df = pd.concat(df_list)
...     manual_df = manual_concat(df_list)
...     if not concat_df.equals(manual_df):
...         raise ValueError("different concatenations!")
>>>
>>> def make_dataframes(num_dfs, num_idx, num_cols, dtype=pd.Float32Dtype(), drop_column=False) -> list[pd.DataFrame]:
...     values = np.random.randint(-100, 100, size=[num_idx, num_cols])
...     index = [f"i{i}" for i in range(num_idx)]
...     columns = np.random.choice([f"c{i}" for i in range(num_cols)], num_cols, replace=False)
...     df = pd.DataFrame(values, index=index, columns=columns, dtype=dtype)
...
...     df_list = []
...     for i in range(num_dfs):
...         new_df = df.copy()
...         if drop_column:
...             label = new_df.columns[i]
...             new_df = new_df.drop(label, axis=1)
...         df_list.append(new_df)
...     return df_list
>>>
>>> test_data = [  # num_idx, num_cols, num_dfs
...     [100, 1_000, 3],
...     ]
>>> for i, (num_idx, num_cols, num_dfs) in enumerate(test_data):
...     print(f"\n{i}: {num_dfs=}, {num_idx=}, {num_cols=}")
...     df_list = make_dataframes(num_dfs, num_idx, num_cols, drop_column=False)
...     df_list_dropped = make_dataframes(num_dfs, num_idx, num_cols, drop_column=True)
...     print("manual:")
...     %timeit manual_concat(df_list)
...     compare_frames(df_list)
...     for use_dropped in [False, True]:
...         print(f"pd.concat: {use_dropped=}")
...         this_df_list = df_list if not use_dropped else df_list_dropped
...         %timeit pd.concat(this_df_list)
0: num_dfs=3, num_idx=100, num_cols=1000
manual:
746 µs ± 64.5 µs per loop  # main
714 µs ± 21.6 µs per loop # this PR
pd.concat: use_dropped=False
290 µs ± 1.6 µs per loop # main
287 µs ± 603 ns per loop  # this PR
pd.concat: use_dropped=True
23.2 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.11 ms ± 13.2 µs per loop  # this PR  # this is the performance  boost

github-actions · 2023-04-10T00:05:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

topper-123 force-pushed the concat_perf branch from 9e815c7 to d29c853 Compare February 16, 2023 07:57

simonjayhawkins added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 22, 2023

topper-123 force-pushed the concat_perf branch from d29c853 to a1fb2ba Compare March 9, 2023 00:19

PERF: faster pd.concat when same concat float dtype but misaligned axis

ee69f90

topper-123 force-pushed the concat_perf branch from a1fb2ba to ee69f90 Compare March 10, 2023 05:45

topper-123 mentioned this pull request Mar 23, 2023

PERF: concat slow, manual concat through reindexing enhances performance #50652

Closed

3 tasks

github-actions bot added the Stale label Apr 10, 2023

topper-123 closed this Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

Uh oh!

topper-123 commented Feb 15, 2023

Uh oh!

github-actions bot commented Apr 10, 2023

Uh oh!

Uh oh!

Uh oh!

PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

Uh oh!

Conversation

topper-123 commented Feb 15, 2023

Uh oh!

github-actions bot commented Apr 10, 2023

Uh oh!

Uh oh!