Skip to content

ASV: reduce overall run time for GroupByMethods benchmarks #44604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 25, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 24, 2021

See #44450 (comment)

This reduces 1) the data size for describe, mad and skew (all the slowest benchmarks that take more than a second is for one of those three), and 2) reduces the parametrization for ncols (this was added in #42841 to also benchmark the case of multiple columns (block-wise), but so I think by keeping two options (single col / multiple col) that aspect should still be captured adequately).

This reduces the total runtime for this single class of benchmarks from 18min to 3min (and the --quick runtime (as used on CI) from 4min to 10s. Note this is only the sum of the actual (repeated) timings, so not including the setup and other overhead of running the benchmarks).

@jorisvandenbossche jorisvandenbossche added the Benchmark Performance (ASV) benchmarks label Nov 24, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.4 milestone Nov 24, 2021
@@ -464,7 +465,12 @@ def setup(self, dtype, method, application, ncols):
# DataFrameGroupBy doesn't have these methods
raise NotImplementedError

ngroups = 1000
if method == "describe" and ncols == 5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this without the ncols check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im imagining looking at results and seeing the ncols==5 case being faster than the ncols==1 case (or just much less than 5x slower) and being confused until i remember this special case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree or add a comment here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, certainly (I initially used ngroups = 100 for each of the three methods for both ncols parameters, but then only describe for ncols=5 was still on the slow side.
Using the smaller ngroups for both cases of describe will make the ncols=1 still faster, but that can't hurt I suppose.

@jreback jreback merged commit c8a0804 into pandas-dev:master Nov 25, 2021
@jorisvandenbossche jorisvandenbossche deleted the asv-reduce-groupby branch November 26, 2021 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants