ENH: add efficient groupby
for sorted dataframe
#43011
Labels
Enhancement
Groupby
Needs Discussion
Requires discussion from core team before further action
Performance
Memory or execution speed performance
Feature description
I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of
O(nlog(n))
and splitting a column according to its consecutive same value requires time complexity of justO(n)
, given that the currentgroupby
of Pandas has the time complexity of O(n^2), it would be better if we can performdataframe.groupby
by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity ofO(n)
. Together with the time complexity of sorting, the total time complexity isO(nlog(n))
.Here is my solution
DataFrame.groupby
should get a new parametersorted
or something intuitive to performgroupby
withO(n)
time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)Existing similar implementation
see here.
Does this make sense? If so, I can implement this in Pandas and create a PR.
The text was updated successfully, but these errors were encountered: