ENH: add efficient `groupby` for sorted dataframe #43011

AdeBC · 2021-08-13T02:58:18Z

Feature description

I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of O(nlog(n)) and splitting a column according to its consecutive same value requires time complexity of just O(n), given that the current groupby of Pandas has the time complexity of O(n^2), it would be better if we can perform dataframe.groupby by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity of O(n). Together with the time complexity of sorting, the total time complexity is O(nlog(n)).

Here is my solution

DataFrame.groupby should get a new parameter sorted or something intuitive to perform groupby with O(n) time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)

Existing similar implementation

see here.

def groupby(dataframe, by):
    """
    Groupby the dataframe (:param dataframe) according to the values of the column by (:param by).
    low_memory (:param low_memory) mode can only work when the input dataframe are sorted by the column by.
    :return: extracted Groups.
    """
    df = dataframe[~dataframe[by].isna()]
    col = df[by]
    group_sizes = pd.Series(perGroupSize(col))  # get size of each group according to the consecutive same values, `O(nlog(n))`
    ends = group_sizes.cumsum()
    begins = ends.shift(1).fillna(0).astype(int)
    group_names = tqdm(group_sizes.index,
                       desc='Generating groups from original dataframe, progress: ',
                       total=group_sizes.shape[0])
    groups = {name: df.iloc[begins[name]:ends[name], :].reset_index(drop=True) for name in group_names}
    return groups

Does this make sense? If so, I can implement this in Pandas and create a PR.

The text was updated successfully, but these errors were encountered:

shumpohl · 2022-10-24T15:39:54Z

Related: #5494

AdeBC added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 13, 2021

mroeschke added Groupby Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

jbrockmendel mentioned this issue Mar 30, 2023

PERF: Don't copy if already sorted in DataSplitter #52035

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add efficient `groupby` for sorted dataframe #43011

ENH: add efficient `groupby` for sorted dataframe #43011

AdeBC commented Aug 13, 2021 •

edited

Loading

shumpohl commented Oct 24, 2022

ENH: add efficient groupby for sorted dataframe #43011

ENH: add efficient groupby for sorted dataframe #43011

Comments

AdeBC commented Aug 13, 2021 • edited Loading

Feature description

Here is my solution

Existing similar implementation

shumpohl commented Oct 24, 2022

ENH: add efficient `groupby` for sorted dataframe #43011

ENH: add efficient `groupby` for sorted dataframe #43011

AdeBC commented Aug 13, 2021 •

edited

Loading