Skip to content

PERF: faster pd.concat when same concat float dtype but misaligned axis #51419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

topper-123
Copy link
Contributor

faster concatenation for misaligned float dataframes.

>>> from itertools import product
>>> import numpy as np
>>> import pandas as pd
>>> from pandas.core.reshape.concat import _Concatenator
>>>
>>> def manual_concat(df_list: list[pd.DataFrame]) -> pd.DataFrame:
...     columns = [col for df in df_list for col in df.columns]
...     columns = list(dict.fromkeys(columns))
...     index = np.hstack([df.index.values for df in df_list])
...     df_list = [df.reindex(columns=columns) for df in df_list]
...     values = np.vstack([df.values for df in df_list])
...     return pd.DataFrame(values, index=index, columns=columns, dtype=df_list[0].dtypes[0])
>>>
>>> def compare_frames(df_list: list[pd.DataFrame]) -> None:
...     concat_df = pd.concat(df_list)
...     manual_df = manual_concat(df_list)
...     if not concat_df.equals(manual_df):
...         raise ValueError("different concatenations!")
>>>
>>> def make_dataframes(num_dfs, num_idx, num_cols, dtype=pd.Float32Dtype(), drop_column=False) -> list[pd.DataFrame]:
...     values = np.random.randint(-100, 100, size=[num_idx, num_cols])
...     index = [f"i{i}" for i in range(num_idx)]
...     columns = np.random.choice([f"c{i}" for i in range(num_cols)], num_cols, replace=False)
...     df = pd.DataFrame(values, index=index, columns=columns, dtype=dtype)
...
...     df_list = []
...     for i in range(num_dfs):
...         new_df = df.copy()
...         if drop_column:
...             label = new_df.columns[i]
...             new_df = new_df.drop(label, axis=1)
...         df_list.append(new_df)
...     return df_list
>>>
>>> test_data = [  # num_idx, num_cols, num_dfs
...     [100, 1_000, 3],
...     ]
>>> for i, (num_idx, num_cols, num_dfs) in enumerate(test_data):
...     print(f"\n{i}: {num_dfs=}, {num_idx=}, {num_cols=}")
...     df_list = make_dataframes(num_dfs, num_idx, num_cols, drop_column=False)
...     df_list_dropped = make_dataframes(num_dfs, num_idx, num_cols, drop_column=True)
...     print("manual:")
...     %timeit manual_concat(df_list)
...     compare_frames(df_list)
...     for use_dropped in [False, True]:
...         print(f"pd.concat: {use_dropped=}")
...         this_df_list = df_list if not use_dropped else df_list_dropped
...         %timeit pd.concat(this_df_list)
0: num_dfs=3, num_idx=100, num_cols=1000
manual:
746 µs ± 64.5 µs per loop  # main
714 µs ± 21.6 µs per loop # this PR
pd.concat: use_dropped=False
290 µs ± 1.6 µs per loop # main
287 µs ± 603 ns per loop  # this PR
pd.concat: use_dropped=True
23.2 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.11 ms ± 13.2 µs per loop  # this PR  # this is the performance  boost

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 22, 2023
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: concat slow, manual concat through reindexing enhances performance
2 participants