-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Releasing benchmarks and benchmarking PR #2432
Conversation
Apologies for opening a new PR, the old PR got cluttered because of testing, but here it is. Two major comments on the previous PR for context
|
270cd88
to
874a845
Compare
Codecov ReportBase: 53.35% // Head: 53.30% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2432 +/- ##
==========================================
- Coverage 53.35% 53.30% -0.06%
==========================================
Files 116 116
Lines 10270 10270
==========================================
- Hits 5480 5474 -6
- Misses 4370 4374 +4
- Partials 420 422 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
5114ccc
to
4f46bdd
Compare
@sozercan Github API returns JSON data with unescaped character sequences. Hence, a parser like |
a45f0b7
to
51f48fc
Compare
.github/workflows/workflow.yaml
Outdated
export GOPATH="$HOME/go" | ||
PATH="$GOPATH/bin:$PATH" | ||
go install golang.org/x/perf/cmd/benchstat@latest | ||
benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if the release benchmark was performed on an identical machine/load profile? If not, how much does that impact the accuracy of the comparison? Are we measuring differences in code performance or machine performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Since the release job also runs on ubuntu-latest
the same as comparing the benchmark job which is a standard runner in github, we may say that we are trying to compare code performance of update (the PR in question) against what is released in the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing anything about CPU architecture. Are they guaranteed to run on a fixed generation?
Easiest way to eliminate most infra variables would be to run both PRs next to each other on the same machine (maybe pulling the current build from Dockerhub to save time), though I'm open to other methods.
Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?
We are doing this with count=5
but it's on the same machine. Running on diff machines would require multiple runners and then we'll need another runner to combine the results, seems a bit too complex?
https://github.com/open-policy-agent/gatekeeper/pull/2432/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have some kind of signal that the performance numbers are apples-to-apples, otherwise we'll spend time trying to diagnose a regression that's really just an architecture difference.
I'm open to whatever gives us confidence this is the case. Unfortunately thread count alone probably isn't sufficient, since not all threads are created equal.
TBH, my preferred order (of alternatives I can think of) would be:
- run old/new code on the same machine for each test run (I think OPA does this)
- Running old/new code on different machines, with the both run several times to get some measure of per-machine variance (an attempt to quantify the likelihood of processor skew)
- Assurances that hardware/software stack are identical across runs (might be impossible to guarantee this, since upgrades are likely/expected, and so there will always be the possibility that maintenance has been performed between runs)
I might be missing some alternatives?
Or are we not looking to make conclusions by doing comparative analysis? If not, what are we hoping to use this data for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback is though we are running two benchmarking process that increases the time to complete workflow by 10 mins avg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests seem to be taking 20mins+ now which makes e2e take over twice as long
@acpana should we decrease count to 5? anything else we can do?
perhaps this can be a manual command (like /benchmark
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we could hide the benchmark job behind a manual command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely running benchmarks less frequently (or manually) could be a good solution here. It will at least let us know if something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. After this PR is merged, we could run benchmarks on PR with commenting /benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really like this PR! I think it's important to know how our changes impact perf.
i left a couple of call outs and questions 💯
.github/workflows/workflow.yaml
Outdated
with: | ||
issue-number: ${{ github.event.pull_request.number }} | ||
body: | | ||
This PR compares its performance to the latest released version. If it performs significantly lower, consider optimizing your changes to improve the performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define what significantly lower means? e.g. 5% ns/op delta, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And on further thought, could we actually enhance this action to some processing of the results? Like show which benches have changed meaningfully?
As I read this right now, do we comment out the whole benchmark output? That may be a bit too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into this and the capabilities of benchstat
to see if this is possible or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on defining delta. Never gave that a thought! Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think the delta should be? Do you know any tools that could process the benchstat output to prettify it? As of now, we are posting the complete result of benchstat on PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pick a value to start with and then adjust as needed. But if we are only running this job thru the comment cmd /benchmark
, then maybe we don't need to set a threshold and just rely on one off analyses from the pr authors & reviewers
Makefile
Outdated
@@ -103,6 +105,10 @@ all: lint test manager | |||
native-test: | |||
GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out | |||
|
|||
.PHONY: benchmark-test | |||
benchmark-test: | |||
go test ./pkg/... -bench . -run="^#" -count 5 > ${BENCHMARK_FILE_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 callouts here
-
from my experience benchmarking
count=5
can yield statistically insignificant results; I usually usecount=10,20
(some 10 multiple), -
the default timeout here is 10min but the GH action sets a timeout of 30 minutes; for debugging this may cause some confusion. Let's use the same timeout value in both places if that makes sense.
-
I think it's best practice when benchmarking to set
GOMAXPROCS
here to decrease hardware variance. This is along the lines of Max's thoughts above. In particular for this cmd, let's set this toGOMAXPROCS=1 go test ...
or some env var that we control but let's set it to have more consistent results across hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh thanks for this input!
.github/workflows/workflow.yaml
Outdated
@@ -466,6 +530,7 @@ jobs: | |||
files: | | |||
_dist/sha256sums.txt | |||
_dist/*.tar.gz | |||
_dist/release_benchmarks.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to publish benchmark results for the release now? I think this could be useful, but per Max's comments above, if we only compare the benchmark run against tip of main/ base of branch then would this be actually used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we are considering running a benchmark for head every time, then we don't need to run/publish benchmark in the release.
.github/workflows/workflow.yaml
Outdated
export GOPATH="$HOME/go" | ||
PATH="$GOPATH/bin:$PATH" | ||
go install golang.org/x/perf/cmd/benchstat@latest | ||
benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( chiming in to throw my 2cents ) I think benchmarking against the head of the branch would be more incrementally useful as we iterate over a change.
i.e. "does my latest commit change the performance in this branch" vs "does my latest commit OR something that was checked into main
change the performance in this branch" ?
.github/workflows/workflow.yaml
Outdated
run: | | ||
export GOPATH="$HOME/go" | ||
PATH="$GOPATH/bin:$PATH" | ||
go install golang.org/x/perf/cmd/benchstat@latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pin the benchstat
version and/ or expose it as an env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think there are tagged versions. can we specify a digest here? https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
.github/workflows/workflow.yaml
Outdated
@@ -317,6 +317,60 @@ jobs: | |||
done | |||
done | |||
|
|||
benchmark: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you test this end to end? (i ask bc I am writing a GH wf too hehe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this job on my fork by manually creating releases and generating needed artifacts. Then testing this job against two branches in my fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've been using act
to test locally too. it helps w the development iteration: https://github.com/nektos/act
.github/workflows/workflow.yaml
Outdated
- name: Download release benchmark file | ||
uses: robinraju/release-downloader@v1.6 | ||
id: get-latest-benchmark | ||
continue-on-error: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also could we make this wf fail actually? what were the arguments for this again?
I don't want us to be tricked by the ✅ in the checks summary only to find out that we can't actually download the data that we need.
https://github.com/open-policy-agent/gatekeeper/actions/runs/3680438765/jobs/6226037201
At least an ❌ signals to the author to investigate. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument was that just missing release benchmark data is not enough reason to fail the workflow. But, yeah I think somehow outputting the error data or indicating that there was an error please look at the result of the job would be useful. I think we could post the message on the PR the same way we are outputting benchstat data to indicate the error on retrieving past data.
a069f31
to
ed53110
Compare
.github/workflows/workflow.yaml
Outdated
|
||
- name: Run benchmark with incoming changes | ||
run: | | ||
curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v${KUBEBUILDER_VERSION}/kubebuilder_${KUBEBUILDER_VERSION}_linux_amd64.tar.gz" &&\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine these i think and install once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
204f507
to
cb9fccc
Compare
e0c31de
to
b88efa4
Compare
Makefile
Outdated
@@ -103,6 +105,10 @@ all: lint test manager | |||
native-test: | |||
GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out | |||
|
|||
.PHONY: benchmark-test | |||
benchmark-test: | |||
go test ./pkg/... -bench . -run="^#" -count 10 > ${BENCHMARK_FILE_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is not used anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I kept it since it's still useful if dev wants to run it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add GOMAXPROCS=1 here and use this in the github action
6c393e6
to
aefe465
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making all the changes and for your patience. This overall LGTM w a few questions/ open threads!
.github/workflows/benchmark.yaml
Outdated
contents: write | ||
pull-requests: write | ||
steps: | ||
- uses: izhangzhihao/delete-comment@master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I can't recall, why are we removing the comments from the bot here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this situation -> I ran the benchmark and got the result on PR, but now I changed something and ran the benchmark again on the PR and got the result on PR. Since the benchmark result is really long the PR can get cluttered easily. And in any case, we only care about the most recent benchmarks, so we aren't losing anything by deleting previous comments from the bot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to have the history to see the impact of various changes?
Also, are older comments collapsed-by-default once the threads get large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong opinions about keeping or not keeping the older stats in the comments.
While testing multiple benchmarks on my fork, I got frustrated because of multiple results and a list of /benchmark
comments. So I added the part to have previous comments by bot deleted.
I am fine with either merging this PR as is and then following up after if we feel that we want all the benchmarks results on PR, or updating this PR to keep all the bot comments and then following up with deleting comments if need be in the future.
How do folks feel about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay then! Let's merge this and will follow up if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for doing this!
3db14ae
to
2e9a237
Compare
.github/workflows/benchmark.yaml
Outdated
contents: write | ||
pull-requests: write | ||
steps: | ||
- uses: izhangzhihao/delete-comment@master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pin to a digest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
6dc378e
to
cf833a9
Compare
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
What this PR does / why we need it:
Releasing benchmark stats with each release and comparing benchmark data from each PR against latest released benchmark stats.
Which issue(s) this PR fixes (optional, using
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when the PR gets merged):Fixes #692
Special notes for your reviewer: