ci: Releasing benchmarks and benchmarking PR #2432

JaydipGabani · 2022-12-06T23:07:55Z

What this PR does / why we need it:
Releasing benchmark stats with each release and comparing benchmark data from each PR against latest released benchmark stats.

Which issue(s) this PR fixes (optional, using fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when the PR gets merged):
Fixes #692

Special notes for your reviewer:

JaydipGabani · 2022-12-06T23:14:21Z

Apologies for opening a new PR, the old PR got cluttered because of testing, but here it is.

Two major comments on the previous PR for context

We are continuing on error because that concludes the job without error, but we are not really executing rest of the steps in benchmark job
If the release_benchmark.txt doesn't exist steps.get-latest-benchmark.outcome results in failure instead of success and hence we skip the rest of the steps.

codecov-commenter · 2022-12-07T18:53:26Z

Codecov Report

Base: 53.35% // Head: 53.30% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (7e7e4f9) compared to base (7bdb64f).
Patch has no changes to coverable lines.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2432      +/-   ##
==========================================
- Coverage   53.35%   53.30%   -0.06%     
==========================================
  Files         116      116              
  Lines       10270    10270              
==========================================
- Hits         5480     5474       -6     
- Misses       4370     4374       +4     
- Partials      420      422       +2

Flag	Coverage Δ
unittests	`53.30% <ø> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...onstrainttemplate/constrainttemplate_controller.go	`56.69% <0.00%> (-3.12%)`	⬇️
pkg/controller/config/config_controller.go	`65.10% <0.00%> (+1.27%)`	⬆️
pkg/watch/replay.go	`81.25% <0.00%> (+2.27%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

JaydipGabani · 2022-12-07T22:35:30Z

@sozercan Github API returns JSON data with unescaped character sequences. Hence, a parser like jq fails to interpret the JSON object returned by API and fails. It is coming out to be a little more difficult than I expected to extract the latest semver from release data. Let's go ahead with getting the release_benchmark.txt from the latest tagged release for now, and we can come back to this later if required.

maxsmythe · 2022-12-10T02:40:40Z

.github/workflows/workflow.yaml

+          export GOPATH="$HOME/go"
+          PATH="$GOPATH/bin:$PATH"
+          go install golang.org/x/perf/cmd/benchstat@latest
+          benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt


Do we know if the release benchmark was performed on an identical machine/load profile? If not, how much does that impact the accuracy of the comparison? Are we measuring differences in code performance or machine performance?

Yes, Since the release job also runs on ubuntu-latest the same as comparing the benchmark job which is a standard runner in github, we may say that we are trying to compare code performance of update (the PR in question) against what is released in the latest version.

I'm not seeing anything about CPU architecture. Are they guaranteed to run on a fixed generation?

Easiest way to eliminate most infra variables would be to run both PRs next to each other on the same machine (maybe pulling the current build from Dockerhub to save time), though I'm open to other methods.

Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?

Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?

We are doing this with count=5 but it's on the same machine. Running on diff machines would require multiple runners and then we'll need another runner to combine the results, seems a bit too complex?
https://github.com/open-policy-agent/gatekeeper/pull/2432/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R110

We need to have some kind of signal that the performance numbers are apples-to-apples, otherwise we'll spend time trying to diagnose a regression that's really just an architecture difference.

I'm open to whatever gives us confidence this is the case. Unfortunately thread count alone probably isn't sufficient, since not all threads are created equal.

TBH, my preferred order (of alternatives I can think of) would be:

run old/new code on the same machine for each test run (I think OPA does this)

Running old/new code on different machines, with the both run several times to get some measure of per-machine variance (an attempt to quantify the likelihood of processor skew)

Assurances that hardware/software stack are identical across runs (might be impossible to guarantee this, since upgrades are likely/expected, and so there will always be the possibility that maintenance has been performed between runs)

I might be missing some alternatives?

Or are we not looking to make conclusions by doing comparative analysis? If not, what are we hoping to use this data for?

One drawback is though we are running two benchmarking process that increases the time to complete workflow by 10 mins avg.

tests seem to be taking 20mins+ now which makes e2e take over twice as long

@acpana should we decrease count to 5? anything else we can do?

perhaps this can be a manual command (like /benchmark)?

+1, we could hide the benchmark job behind a manual command

Definitely running benchmarks less frequently (or manually) could be a good solution here. It will at least let us know if something went wrong.

Done. After this PR is merged, we could run benchmarks on PR with commenting /benchmark

acpana

i really like this PR! I think it's important to know how our changes impact perf.

i left a couple of call outs and questions 💯

acpana · 2022-12-14T20:24:34Z

.github/workflows/workflow.yaml

+        with:
+          issue-number: ${{ github.event.pull_request.number }}
+          body: |
+            This PR compares its performance to the latest released version. If it performs significantly lower, consider optimizing your changes to improve the performance.


can we define what significantly lower means? e.g. 5% ns/op delta, etc

And on further thought, could we actually enhance this action to some processing of the results? Like show which benches have changed meaningfully?

As I read this right now, do we comment out the whole benchmark output? That may be a bit too much

Let me look into this and the capabilities of benchstat to see if this is possible or not.

Good point on defining delta. Never gave that a thought! Good catch

What do you think the delta should be? Do you know any tools that could process the benchstat output to prettify it? As of now, we are posting the complete result of benchstat on PR.

We can pick a value to start with and then adjust as needed. But if we are only running this job thru the comment cmd /benchmark, then maybe we don't need to set a threshold and just rely on one off analyses from the pr authors & reviewers

acpana · 2022-12-14T20:32:25Z

Makefile

@@ -103,6 +105,10 @@ all: lint test manager
 native-test:
 	GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out

+.PHONY: benchmark-test
+benchmark-test:
+	go test ./pkg/... -bench . -run="^#" -count 5 > ${BENCHMARK_FILE_NAME}


3 callouts here

from my experience benchmarking count=5 can yield statistically insignificant results; I usually use count=10,20 (some 10 multiple),

the default timeout here is 10min but the GH action sets a timeout of 30 minutes; for debugging this may cause some confusion. Let's use the same timeout value in both places if that makes sense.

I think it's best practice when benchmarking to set GOMAXPROCS here to decrease hardware variance. This is along the lines of Max's thoughts above. In particular for this cmd, let's set this to GOMAXPROCS=1 go test ... or some env var that we control but let's set it to have more consistent results across hardware.

Ohh thanks for this input!

acpana · 2022-12-14T20:34:21Z

.github/workflows/workflow.yaml

@@ -466,6 +530,7 @@ jobs:
          files: |
            _dist/sha256sums.txt
            _dist/*.tar.gz
+            _dist/release_benchmarks.txt


do we really want to publish benchmark results for the release now? I think this could be useful, but per Max's comments above, if we only compare the benchmark run against tip of main/ base of branch then would this be actually used?

Yeah, if we are considering running a benchmark for head every time, then we don't need to run/publish benchmark in the release.

acpana · 2022-12-14T20:39:44Z

.github/workflows/workflow.yaml

+          export GOPATH="$HOME/go"
+          PATH="$GOPATH/bin:$PATH"
+          go install golang.org/x/perf/cmd/benchstat@latest
+          benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt


( chiming in to throw my 2cents ) I think benchmarking against the head of the branch would be more incrementally useful as we iterate over a change.

i.e. "does my latest commit change the performance in this branch" vs "does my latest commit OR something that was checked into main change the performance in this branch" ?

acpana · 2022-12-14T20:39:52Z

.github/workflows/workflow.yaml

+        run: |
+          export GOPATH="$HOME/go"
+          PATH="$GOPATH/bin:$PATH"
+          go install golang.org/x/perf/cmd/benchstat@latest


can we pin the benchstat version and/ or expose it as an env var?

i don't think there are tagged versions. can we specify a digest here? https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

acpana · 2022-12-14T20:57:05Z

.github/workflows/workflow.yaml

@@ -317,6 +317,60 @@ jobs:
            done
          done

+  benchmark:


how did you test this end to end? (i ask bc I am writing a GH wf too hehe)

I tested this job on my fork by manually creating releases and generating needed artifacts. Then testing this job against two branches in my fork.

i've been using act to test locally too. it helps w the development iteration: https://github.com/nektos/act

acpana · 2022-12-14T21:02:57Z

.github/workflows/workflow.yaml

+      - name: Download release benchmark file
+        uses: robinraju/release-downloader@v1.6
+        id: get-latest-benchmark
+        continue-on-error: true


also could we make this wf fail actually? what were the arguments for this again?

I don't want us to be tricked by the ✅ in the checks summary only to find out that we can't actually download the data that we need.

https://github.com/open-policy-agent/gatekeeper/actions/runs/3680438765/jobs/6226037201

At least an ❌ signals to the author to investigate. Wdyt?

The argument was that just missing release benchmark data is not enough reason to fail the workflow. But, yeah I think somehow outputting the error data or indicating that there was an error please look at the result of the job would be useful. I think we could post the message on the PR the same way we are outputting benchstat data to indicate the error on retrieving past data.

sozercan · 2022-12-15T23:04:00Z

.github/workflows/workflow.yaml

+
+      - name: Run benchmark with incoming changes
+        run: |
+          curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v${KUBEBUILDER_VERSION}/kubebuilder_${KUBEBUILDER_VERSION}_linux_amd64.tar.gz" &&\


we can combine these i think and install once

sozercan · 2022-12-20T21:28:35Z

Makefile

@@ -103,6 +105,10 @@ all: lint test manager
 native-test:
 	GO111MODULE=on go test -mod vendor ./pkg/... ./apis/... -race -bench . -coverprofile cover.out

+.PHONY: benchmark-test
+benchmark-test:
+	go test ./pkg/... -bench . -run="^#" -count 10 > ${BENCHMARK_FILE_NAME}


looks like this is not used anymore?

Yes! I kept it since it's still useful if dev wants to run it locally.

can we add GOMAXPROCS=1 here and use this in the github action

acpana

Thanks for making all the changes and for your patience. This overall LGTM w a few questions/ open threads!

acpana · 2023-01-03T22:32:08Z

.github/workflows/benchmark.yaml

+      contents: write
+      pull-requests: write
+    steps:
+      - uses: izhangzhihao/delete-comment@master


Ah I can't recall, why are we removing the comments from the bot here?

Consider this situation -> I ran the benchmark and got the result on PR, but now I changed something and ran the benchmark again on the PR and got the result on PR. Since the benchmark result is really long the PR can get cluttered easily. And in any case, we only care about the most recent benchmarks, so we aren't losing anything by deleting previous comments from the bot.

Would it be good to have the history to see the impact of various changes?

Also, are older comments collapsed-by-default once the threads get large?

I don't have any strong opinions about keeping or not keeping the older stats in the comments.

While testing multiple benchmarks on my fork, I got frustrated because of multiple results and a list of /benchmark comments. So I added the part to have previous comments by bot deleted.

I am fine with either merging this PR as is and then following up after if we feel that we want all the benchmarks results on PR, or updating this PR to keep all the bot comments and then following up with deleting comments if need be in the future.

How do folks feel about this?

Either works for me.

Okay then! Let's merge this and will follow up if needed.

maxsmythe

LGTM, thanks for doing this!

sozercan · 2023-01-04T21:25:03Z

.github/workflows/benchmark.yaml

+      contents: write
+      pull-requests: write
+    steps:
+      - uses: izhangzhihao/delete-comment@master


can we pin to a digest?

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

JaydipGabani force-pushed the master branch 3 times, most recently from 270cd88 to 874a845 Compare December 7, 2022 18:46

JaydipGabani force-pushed the master branch 7 times, most recently from 5114ccc to 4f46bdd Compare December 7, 2022 22:15

JaydipGabani force-pushed the master branch 2 times, most recently from a45f0b7 to 51f48fc Compare December 9, 2022 22:43

maxsmythe reviewed Dec 10, 2022

View reviewed changes

JaydipGabani force-pushed the master branch from 51f48fc to 2882040 Compare December 12, 2022 22:40

JaydipGabani requested a review from maxsmythe December 13, 2022 00:07

acpana requested changes Dec 14, 2022

View reviewed changes

acpana reviewed Dec 14, 2022

View reviewed changes

JaydipGabani force-pushed the master branch 5 times, most recently from a069f31 to ed53110 Compare December 15, 2022 22:32

sozercan reviewed Dec 15, 2022

View reviewed changes

JaydipGabani force-pushed the master branch 4 times, most recently from 204f507 to cb9fccc Compare December 16, 2022 08:34

JaydipGabani force-pushed the master branch 2 times, most recently from e0c31de to b88efa4 Compare December 19, 2022 18:03

JaydipGabani requested review from sozercan and acpana and removed request for acpana and sozercan December 19, 2022 21:59

JaydipGabani force-pushed the master branch from b88efa4 to 79f1884 Compare December 20, 2022 00:35

sozercan reviewed Dec 20, 2022

View reviewed changes

JaydipGabani force-pushed the master branch 3 times, most recently from 6c393e6 to aefe465 Compare December 27, 2022 23:49

acpana approved these changes Jan 3, 2023

View reviewed changes

maxsmythe approved these changes Jan 4, 2023

View reviewed changes

JaydipGabani force-pushed the master branch 2 times, most recently from 3db14ae to 2e9a237 Compare January 4, 2023 20:40

sozercan reviewed Jan 4, 2023

View reviewed changes

JaydipGabani force-pushed the master branch 2 times, most recently from 6dc378e to cf833a9 Compare January 5, 2023 01:19

JaydipGabani requested a review from sozercan January 5, 2023 18:19

sozercan approved these changes Jan 5, 2023

View reviewed changes

sozercan and others added 6 commits January 6, 2023 18:52

benchmark prs

b40aac8

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>

Benchmarking against base ref on PR comment

45ef223

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

running benchmark with manual trigger by commenting on PR

5532e10

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

runnint more iteration of test in benchmark

b9ff733

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

pinning pr comment delete action digest

dea980f

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Using make rules for benchmarking

7e7e4f9

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

JaydipGabani force-pushed the master branch from be6cfd5 to 7e7e4f9 Compare January 6, 2023 18:57

sozercan merged commit 8dc9cc8 into open-policy-agent:master Jan 6, 2023

ritazh mentioned this pull request Mar 30, 2023

document /benchmark comment for PRs #2660

Closed

p-linnane mentioned this pull request Apr 14, 2023

gator 3.12.0 Homebrew/homebrew-core#128418

Merged

ci: Releasing benchmarks and benchmarking PR #2432

ci: Releasing benchmarks and benchmarking PR #2432

Conversation

JaydipGabani commented Dec 6, 2022

JaydipGabani commented Dec 6, 2022

codecov-commenter commented Dec 7, 2022 • edited Loading

Codecov Report

JaydipGabani commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sozercan Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acpana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sozercan Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JaydipGabani Dec 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acpana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JaydipGabani Jan 6, 2023 • edited Loading

Choose a reason for hiding this comment

maxsmythe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 7, 2022 •

edited

Loading

sozercan Dec 13, 2022 •

edited

Loading

sozercan Dec 15, 2022 •

edited

Loading

JaydipGabani Dec 21, 2022 •

edited

Loading

JaydipGabani Jan 6, 2023 •

edited

Loading