Skip to content

ENH: add efficient groupby for sorted dataframe #43011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AdeBC opened this issue Aug 13, 2021 · 1 comment
Open

ENH: add efficient groupby for sorted dataframe #43011

AdeBC opened this issue Aug 13, 2021 · 1 comment
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@AdeBC
Copy link

AdeBC commented Aug 13, 2021

Feature description

I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of O(nlog(n)) and splitting a column according to its consecutive same value requires time complexity of just O(n), given that the current groupby of Pandas has the time complexity of O(n^2), it would be better if we can perform dataframe.groupby by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity of O(n). Together with the time complexity of sorting, the total time complexity is O(nlog(n)).

Here is my solution

DataFrame.groupby should get a new parameter sorted or something intuitive to perform groupby with O(n) time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)

Existing similar implementation

see here.

def groupby(dataframe, by):
    """
    Groupby the dataframe (:param dataframe) according to the values of the column by (:param by).
    low_memory (:param low_memory) mode can only work when the input dataframe are sorted by the column by.
    :return: extracted Groups.
    """
    df = dataframe[~dataframe[by].isna()]
    col = df[by]
    group_sizes = pd.Series(perGroupSize(col))  # get size of each group according to the consecutive same values, `O(nlog(n))`
    ends = group_sizes.cumsum()
    begins = ends.shift(1).fillna(0).astype(int)
    group_names = tqdm(group_sizes.index,
                       desc='Generating groups from original dataframe, progress: ',
                       total=group_sizes.shape[0])
    groups = {name: df.iloc[begins[name]:ends[name], :].reset_index(drop=True) for name in group_names}
    return groups

Does this make sense? If so, I can implement this in Pandas and create a PR.

@AdeBC AdeBC added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 13, 2021
@mroeschke mroeschke added Groupby Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@shumpohl
Copy link

Related: #5494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants