-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add /benchmark
github command to comparison benchmark between base and pr commit
#9461
Conversation
b92732e
to
e371175
Compare
522d19d
to
be9b644
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gruuya -- I think the basic pattern of this PR looks good to me. I wonder how we could test it... Maybe I could merge it to my fork and run it there 🤔
The basic idea looks great -- thank you. The major concern I have is that this PR seems to run the benchmark on github runners, as I understand it
The downside of using github runners is that I think they do not have stable base performance. I think they are shared and they also may vary from one run to the next (e.g. different processor). If the idea is to benchmark just the code change, keeping the runtime environment the same I think is important. For example if the tests report that performance slows down on a PR but the problem really was that the github runner was overloaded, that will be super confusing
I think I had in my mind that we would have a dedicated benchmarking machine / VM running somewhere and run the benchmarks on that VM. I am pretty sure InfluxData would be willing to cover the cost of this VM (or we can make a shared collection)
The beneift of this approach would be that the benchmark enviroment would be more controlled (the only change would be the software change), though it has monetary cost as well won't scale up/down with our jobs
True, that is correct. My assumption was that any instabilities in the base performance would not vary greatly during a single run as both benchmarks are run within the same job in a relatively short time interval, but I guess this is not a given. In addition, for this type of benchmarking (PR-vs-main) we're only interested in relative comparisons, and so the longitudinal variance component, which is undoubtedly large wouldn't come into play (unlike for the tracking of main perf across time). That said the present workflows should be easily extendable to use self-hosted runners once those become available I believe. This would also bring additional benefits, such as shorter benchmark runs, both in terms of persisting the test data (e.g. downloading hits.parquet takes about ~12 minutes in CI), as well as the runs themselves (means we could use beefier SFs, i.e. 10 or even 100).
Oh that's a good idea, I believe I can test it out on our fork and provide the details here, thanks! |
Ok I went ahead and tested it on our fork: splitgraph#1 Here it is catching a deliberate regression I introduced: splitgraph#1 (comment) Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment) There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well). |
/benchmark
github command to comparison benchmark between base and pr commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gruuya -- I think this is a really neat step forward and we can refine this functionality over time.
I think we should also add a section in the benchmark documentation about this feature to make it easier to find.
https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks
We could do that as a follow on PR
Ok I went ahead and tested it on our fork: splitgraph#1
Here it is catching a deliberate regression I introduced: splitgraph#1 (comment)
Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment)
Those results are very cool
There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well).
Yes, you are correct, I have also seen similar variations locally.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Yup, makes sense! I'd also like to bump SF to 10 there to reduce the noise a bit, and perhaps add parquet/sorting benches as well (assuming those things don't take too long). Eventually when we have a self-hosted runner we can add a selection of ClickBench queries too. Thanks! |
I think ClickBench would be a good choice too (it doesn't take too long and is an excellent benchmark for aggregation / filtering performance) So how about we merge this PR once CI has passed and then file follow on tickets for the remaining tasks? |
Sounds good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thank you @gruuya !
FWI this code was removed in #11165 |
Which issue does this PR close?
Partially progresses #5504.
Rationale for this change
Enable automated comparative PR benchmarking, and thus make performance regression less likely to be introduced into the code.
What changes are included in this PR?
Add two new workflows:
Benchmarks
, that is triggered by a PR comment/benchmark
, and which runs the standard repo benchmarks against the base and head commits.PR Comment
workflow, that in this case should post a message along the lines ofAre these changes tested?
Sort of.
There is a big catch-22 here at play:
pull_request
event to run the benchmarks, as is officially recommended. The workflow fails just prior to posting a comment because in the case ofpull_request
events there are insufficient permission to write on the PR (trying to override the permissions of a job itself doesn't work).workflow_call
workflow, but the problem there is that this type of workflow inherits permissions from the caller workflow so again this isn't sufficient.pull_request_target
which does have sufficient permissions and call aworkflow_call
workflow, but suffers from another problem, namely it must be present on the default branch to be run, so in this case it won't be executed until the PR is merged..workflow_run
event that gets triggered by the benchmark workflow, but the same problem appears here (can't run until merged).In the end I decided to go with a
issue_comment
-based (covers pull requests too) benchmarking workflow (which also needs to be on the main branch to be run, and then make that trigger the PR commenting workflow via aworkflow_run
event. I figured since there has to be a two-step motion to test this out anyway we might as well try not to spam and add noise between steps (plus I envision the PR benchmarking to be triggered by a comment anyway).Are there any user-facing changes?
None (there will be dev-facing changes with the follow-up).