As mentioned in the charter, SIG scalability has a right to block all PRs from merging into the relevant repos. This document describes the underlying "Rules of engagement" of this process and the rationale why this is needed.
The rules of engagement for blocking merges are as following:
- Observe as scalability regression on one of release-blocking test suites (defined as green to red transition - if tests were already failing, we don't have a right to declare a regression).
- Block merges of all PRs to the relevant repos in the affected branch, declaring which repos those are and why.
- Identify the PR which caused the regression:
- this can be done by reading code changes, bisecting, debugging based on metrics and/or logs, etc.
- we say a PR is identified as the cause when we are reasonably confident that it indeed caused a regression, even if the mechanism is not 100% understood to minimize the time when merges are blocked
- Mitigate the regression. This may mean e.g.:
- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- fixing the problem (if it's easy and quick to fix)
- Unblock merges of all PRs to the relevant repos in the affected branch.
The exact technical mechanisms for it are out of scope for this document.
The process described above is quite drastic, but we believe it is justified if we want kubernetes to maintain scalability SLOs. The reasoning is:
- reliably testing for regressions takes a lot of time:
- key scalability e2e tests take too long to execute to be a prerequisite for merging all PRs, this is an inherent characteristic of testing at scale,
- end-to-end tests are flaky (even when not at scale) requiring retries,
- we need to prevent regression pile-ups:
- once a regression is merged, and no other action is taken, it is only a matter of time until another regression is merged on top of it,
- debugging the cause of two simultaneous (piled-up) regressions is exponentially harder, see issue 53255 which links to past experience
- we need to keep flakiness of merge-blocking jobs very low:
- regarding benchmarks, there were several scalability issues in the past
caught by (costly) large-scale e2e tests, which could have been caught and
fixed earlier and with far less human effort if we had benchmark-like
tests. Examples include:
- scheduler anti-affinity affecting kube-dns,
- kubelet network plugin increasing pod-startup latency,
- large responses from apiserver violating gRPC MTU.
As explained in detail in an issue, not being able to maintain passing scalability tests adversely affect:
- release quality
- release schedule
- engineer productivity