Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Releasing benchmarks and benchmarking PR #2432

Merged
merged 6 commits into from
Jan 6, 2023

Conversation

JaydipGabani
Copy link
Contributor

What this PR does / why we need it:
Releasing benchmark stats with each release and comparing benchmark data from each PR against latest released benchmark stats.

Which issue(s) this PR fixes (optional, using fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when the PR gets merged):
Fixes #692

Special notes for your reviewer:

@JaydipGabani
Copy link
Contributor Author

Apologies for opening a new PR, the old PR got cluttered because of testing, but here it is.

Two major comments on the previous PR for context

  1. We are continuing on error because that concludes the job without error, but we are not really executing rest of the steps in benchmark job
  2. If the release_benchmark.txt doesn't exist steps.get-latest-benchmark.outcome results in failure instead of success and hence we skip the rest of the steps.

@JaydipGabani JaydipGabani force-pushed the master branch 3 times, most recently from 270cd88 to 874a845 Compare December 7, 2022 18:46
@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2022

Codecov Report

Base: 53.35% // Head: 53.30% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (7e7e4f9) compared to base (7bdb64f).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2432      +/-   ##
==========================================
- Coverage   53.35%   53.30%   -0.06%     
==========================================
  Files         116      116              
  Lines       10270    10270              
==========================================
- Hits         5480     5474       -6     
- Misses       4370     4374       +4     
- Partials      420      422       +2     
Flag Coverage Δ
unittests 53.30% <ø> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...onstrainttemplate/constrainttemplate_controller.go 56.69% <0.00%> (-3.12%) ⬇️
pkg/controller/config/config_controller.go 65.10% <0.00%> (+1.27%) ⬆️
pkg/watch/replay.go 81.25% <0.00%> (+2.27%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@JaydipGabani JaydipGabani force-pushed the master branch 7 times, most recently from 5114ccc to 4f46bdd Compare December 7, 2022 22:15
@JaydipGabani
Copy link
Contributor Author

@sozercan Github API returns JSON data with unescaped character sequences. Hence, a parser like jq fails to interpret the JSON object returned by API and fails. It is coming out to be a little more difficult than I expected to extract the latest semver from release data. Let's go ahead with getting the release_benchmark.txt from the latest tagged release for now, and we can come back to this later if required.

@JaydipGabani JaydipGabani force-pushed the master branch 2 times, most recently from a45f0b7 to 51f48fc Compare December 9, 2022 22:43
export GOPATH="$HOME/go"
PATH="$GOPATH/bin:$PATH"
go install golang.org/x/perf/cmd/benchstat@latest
benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if the release benchmark was performed on an identical machine/load profile? If not, how much does that impact the accuracy of the comparison? Are we measuring differences in code performance or machine performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Since the release job also runs on ubuntu-latest the same as comparing the benchmark job which is a standard runner in github, we may say that we are trying to compare code performance of update (the PR in question) against what is released in the latest version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing anything about CPU architecture. Are they guaranteed to run on a fixed generation?

Easiest way to eliminate most infra variables would be to run both PRs next to each other on the same machine (maybe pulling the current build from Dockerhub to save time), though I'm open to other methods.

Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?

Copy link
Member

@sozercan sozercan Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?

We are doing this with count=5 but it's on the same machine. Running on diff machines would require multiple runners and then we'll need another runner to combine the results, seems a bit too complex?
https://github.com/open-policy-agent/gatekeeper/pull/2432/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R110

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have some kind of signal that the performance numbers are apples-to-apples, otherwise we'll spend time trying to diagnose a regression that's really just an architecture difference.

I'm open to whatever gives us confidence this is the case. Unfortunately thread count alone probably isn't sufficient, since not all threads are created equal.

TBH, my preferred order (of alternatives I can think of) would be:

  • run old/new code on the same machine for each test run (I think OPA does this)
  • Running old/new code on different machines, with the both run several times to get some measure of per-machine variance (an attempt to quantify the likelihood of processor skew)
  • Assurances that hardware/software stack are identical across runs (might be impossible to guarantee this, since upgrades are likely/expected, and so there will always be the possibility that maintenance has been performed between runs)

I might be missing some alternatives?

Or are we not looking to make conclusions by doing comparative analysis? If not, what are we hoping to use this data for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback is though we are running two benchmarking process that increases the time to complete workflow by 10 mins avg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests seem to be taking 20mins+ now which makes e2e take over twice as long

@acpana should we decrease count to 5? anything else we can do?

perhaps this can be a manual command (like /benchmark)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we could hide the benchmark job behind a manual command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely running benchmarks less frequently (or manually) could be a good solution here. It will at least let us know if something went wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. After this PR is merged, we could run benchmarks on PR with commenting /benchmark

Copy link
Contributor

@acpana acpana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really like this PR! I think it's important to know how our changes impact perf.

i left a couple of call outs and questions 💯

with:
issue-number: ${{ github.event.pull_request.number }}
body: |
This PR compares its performance to the latest released version. If it performs significantly lower, consider optimizing your changes to improve the performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define what significantly lower means? e.g. 5% ns/op delta, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on further thought, could we actually enhance this action to some processing of the results? Like show which benches have changed meaningfully?

As I read this right now, do we comment out the whole benchmark output? That may be a bit too much

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into this and the capabilities of benchstat to see if this is possible or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on defining delta. Never gave that a thought! Good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think the delta should be? Do you know any tools that could process the benchstat output to prettify it? As of now, we are posting the complete result of benchstat on PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pick a value to start with and then adjust as needed. But if we are only running this job thru the comment cmd /benchmark, then maybe we don't need to set a threshold and just rely on one off analyses from the pr authors & reviewers

Makefile Outdated
@@ -103,6 +105,10 @@ all: lint test manager
native-test:
GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out

.PHONY: benchmark-test
benchmark-test:
go test ./pkg/... -bench . -run="^#" -count 5 > ${BENCHMARK_FILE_NAME}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 callouts here

  • from my experience benchmarking count=5 can yield statistically insignificant results; I usually use count=10,20 (some 10 multiple),

  • the default timeout here is 10min but the GH action sets a timeout of 30 minutes; for debugging this may cause some confusion. Let's use the same timeout value in both places if that makes sense.

  • I think it's best practice when benchmarking to set GOMAXPROCS here to decrease hardware variance. This is along the lines of Max's thoughts above. In particular for this cmd, let's set this to GOMAXPROCS=1 go test ... or some env var that we control but let's set it to have more consistent results across hardware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh thanks for this input!

@@ -466,6 +530,7 @@ jobs:
files: |
_dist/sha256sums.txt
_dist/*.tar.gz
_dist/release_benchmarks.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to publish benchmark results for the release now? I think this could be useful, but per Max's comments above, if we only compare the benchmark run against tip of main/ base of branch then would this be actually used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we are considering running a benchmark for head every time, then we don't need to run/publish benchmark in the release.

export GOPATH="$HOME/go"
PATH="$GOPATH/bin:$PATH"
go install golang.org/x/perf/cmd/benchstat@latest
benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( chiming in to throw my 2cents ) I think benchmarking against the head of the branch would be more incrementally useful as we iterate over a change.

i.e. "does my latest commit change the performance in this branch" vs "does my latest commit OR something that was checked into main change the performance in this branch" ?

run: |
export GOPATH="$HOME/go"
PATH="$GOPATH/bin:$PATH"
go install golang.org/x/perf/cmd/benchstat@latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pin the benchstat version and/ or expose it as an env var?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there are tagged versions. can we specify a digest here? https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

@@ -317,6 +317,60 @@ jobs:
done
done

benchmark:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you test this end to end? (i ask bc I am writing a GH wf too hehe)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this job on my fork by manually creating releases and generating needed artifacts. Then testing this job against two branches in my fork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've been using act to test locally too. it helps w the development iteration: https://github.com/nektos/act

- name: Download release benchmark file
uses: robinraju/release-downloader@v1.6
id: get-latest-benchmark
continue-on-error: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could we make this wf fail actually? what were the arguments for this again?

I don't want us to be tricked by the ✅ in the checks summary only to find out that we can't actually download the data that we need.

https://github.com/open-policy-agent/gatekeeper/actions/runs/3680438765/jobs/6226037201

At least an ❌ signals to the author to investigate. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument was that just missing release benchmark data is not enough reason to fail the workflow. But, yeah I think somehow outputting the error data or indicating that there was an error please look at the result of the job would be useful. I think we could post the message on the PR the same way we are outputting benchstat data to indicate the error on retrieving past data.

@JaydipGabani JaydipGabani force-pushed the master branch 5 times, most recently from a069f31 to ed53110 Compare December 15, 2022 22:32

- name: Run benchmark with incoming changes
run: |
curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v${KUBEBUILDER_VERSION}/kubebuilder_${KUBEBUILDER_VERSION}_linux_amd64.tar.gz" &&\
Copy link
Member

@sozercan sozercan Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can combine these i think and install once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JaydipGabani JaydipGabani force-pushed the master branch 4 times, most recently from 204f507 to cb9fccc Compare December 16, 2022 08:34
@JaydipGabani JaydipGabani force-pushed the master branch 2 times, most recently from e0c31de to b88efa4 Compare December 19, 2022 18:03
@JaydipGabani JaydipGabani requested review from sozercan and acpana and removed request for acpana and sozercan December 19, 2022 21:59
Makefile Outdated
@@ -103,6 +105,10 @@ all: lint test manager
native-test:
GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out

.PHONY: benchmark-test
benchmark-test:
go test ./pkg/... -bench . -run="^#" -count 10 > ${BENCHMARK_FILE_NAME}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is not used anymore?

Copy link
Contributor Author

@JaydipGabani JaydipGabani Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I kept it since it's still useful if dev wants to run it locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add GOMAXPROCS=1 here and use this in the github action

@JaydipGabani JaydipGabani force-pushed the master branch 3 times, most recently from 6c393e6 to aefe465 Compare December 27, 2022 23:49
Copy link
Contributor

@acpana acpana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making all the changes and for your patience. This overall LGTM w a few questions/ open threads!

contents: write
pull-requests: write
steps:
- uses: izhangzhihao/delete-comment@master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I can't recall, why are we removing the comments from the bot here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this situation -> I ran the benchmark and got the result on PR, but now I changed something and ran the benchmark again on the PR and got the result on PR. Since the benchmark result is really long the PR can get cluttered easily. And in any case, we only care about the most recent benchmarks, so we aren't losing anything by deleting previous comments from the bot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to have the history to see the impact of various changes?

Also, are older comments collapsed-by-default once the threads get large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong opinions about keeping or not keeping the older stats in the comments.

While testing multiple benchmarks on my fork, I got frustrated because of multiple results and a list of /benchmark comments. So I added the part to have previous comments by bot deleted.

I am fine with either merging this PR as is and then following up after if we feel that we want all the benchmarks results on PR, or updating this PR to keep all the bot comments and then following up with deleting comments if need be in the future.

How do folks feel about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either works for me.

Copy link
Contributor Author

@JaydipGabani JaydipGabani Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then! Let's merge this and will follow up if needed.

Copy link
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

@JaydipGabani JaydipGabani force-pushed the master branch 2 times, most recently from 3db14ae to 2e9a237 Compare January 4, 2023 20:40
contents: write
pull-requests: write
steps:
- uses: izhangzhihao/delete-comment@master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pin to a digest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JaydipGabani JaydipGabani force-pushed the master branch 2 times, most recently from 6dc378e to cf833a9 Compare January 5, 2023 01:19
@JaydipGabani JaydipGabani requested a review from sozercan January 5, 2023 18:19
sozercan and others added 6 commits January 6, 2023 18:52
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show benchmark delta for each PR
5 participants