diff --git a/sig-scalability/block_merges.md b/sig-scalability/block_merges.md new file mode 100644 index 00000000000..d57c68a1703 --- /dev/null +++ b/sig-scalability/block_merges.md @@ -0,0 +1,51 @@ +# Blocking PR merges in the event of regression. + +As mentioned in the charter, SIG scalability has a right to block all PRs +from merging into the relevant repos. This document describes the underlying +"Rules of engagement" of this process and the rationale why this is needed. + +### Rules of engagement. +The rules of engagement for blocking merges are as following: + +- Observe as scalability regression on one of release-blocking test suites. +- Block merges of all PRs. +- Identify the PR which caused the regression: + - this can be done by reading code changes, bisecting, debugging based on + metrics and/or logs, etc. + - we say a PR is identified as the cause when we are reasonably confident + that it indeed caused a regression, even if the mechanism is not 100% + understood to minimize the time when merges are blocked +- Mitigate the regression. This may mean e.g.: + - reverting the PR + - switching a feature off (preferably by default, as last resort only in tests) + - fixing the problem (if it's easy and quick to fix) +- Unblock PR merged. + +The exact technical mechanisms for it are out of scope for this document. + +### Rationale +The process described above is quite drastic, but we believe it is justified +if we want kubernetes to maintain scalability SLOs. The reasoning is: +- reliably testing for regressions takes a lot of time: + - key scalability e2e tests take too long to execute to be a prerequisite + for merging all PRs, this is an inherent characteristic of testing at scale, + - end-to-end tests are flaky (even when not at scale) requiring retries, +- we need to prevent regression pile-ups: + - once a regression is merged, and no other action is taken, it is only + a matter of time until another regression is merged on top of it, + - debugging the cause of two simultaneous (piled-up) regressions is + exponentially harder, see issue 53255 which links to past experience +- we need to keep flakiness of merge-blocking jobs very low: +- regarding benchmarks, there were several scalability issues in the past + caught by (costly) large-scale e2e tests, which could have been caught and + fixed earlier and with far less human effort if we had benchmark-like + tests. Examples include: + - scheduler anti-affinity affecting kube-dns, + - kubelet network plugin increasing pod-startup latency, + - large responses from apiserver violating gRPC MTU. + +As explained in detail in an issue, not being able to maintain passing scalability +tests adversely affect: +- release quality +- release schedule +- engineer productivity diff --git a/sig-scalability/charter.md b/sig-scalability/charter.md new file mode 100644 index 00000000000..acb051d3f12 --- /dev/null +++ b/sig-scalability/charter.md @@ -0,0 +1,98 @@ +# SIG Scalability Charter + +This charter adheres to the conventions described in the [Kubernetes Charter README] +and uses the Roles and Organization Management outlined in [sig-governance]. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md + +## Scope + +SIG Scalability's primary responsibilities are to define and drive scalability +goals for Kubernetes. This involves defining, testing and measuring performance and +scalability related Service Level Indicators (SLIs) and ensuring that every +Kubernetes release meets Service Level Objectives (SLOs) built on top of those +SLIs. + +We also coordinate and contribute to general system-wide scalability and +performance improvements (that don't fall into the charter of another individual +SIG) by driving large architectural changes and finding bottlenecks, as well as +provide consultations about any scalability and performance related aspects of +Kubernetes. + +### In Scope + +#### Code, Binaries and Services: + +- Scalability and performance testing frameworks. Examples include: + - [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2) + - [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark) +- Scalability and performance tests: + - [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/) + - [Jobs running those](https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability) + +#### Cross-cutting and Externally Facing Processes + +- Defining what does “Kubernetes scales” mean by defining (or approving) +individual performance SLIs/SLOs, ensuring they are all oriented on user +experience and consistent with each other: + - [SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) +- Ensuring that each official Kubernetes release satisfies all scalability and +performance related requirements, as stated in "Kubernetes scalability" definition. +- Establishing and documenting best practises on how to design and/or implement +Kubernetes features in scalable and performant way. Educating contributors and +consulting individual designs/implementations to ensure that those are widely used. +Example artifacts: + - [Scalability governance](https://github.com/kubernetes/community/blob/master/sig-scalability/governance) +- Finding system bottlenecks and coordinating improvement on cross-cutting +architectural changes. + +### Out of scope + +- Improving performance/scalability of features falling into charters of +individual SIGs. + + +## What can we do/require from other SIGs +Scalability and performance are horizontal aspects of the system - changes in a +single place of Kubernetes may affect the whole system. As a result, to +effectively ensure Kubernetes scales, we need a special cross-SIG privileges. + +- We can rollback any merged PR if it has been identified as a cause of any + [performance/scalability SLOs] regression (identified by the set of release + blocking scalability/performance tests). The offending PR should only be + merged again after proving to pass tests at scale. +- In the even of a performance regression, we can block all PRs from being + merged into the relevant repos until the cause of the regression is + identified and mitigated. + The “Rules of engagement” of pausing merge-queue and rationale for + necessity of its introduce are explained in [a separate doc](./block_merges.md). +- We require significant changes (in terms of impact, such as: update of etcd, + update of Go version, major architectural changes, etc.) may only be merged: + - with an explicit approval from a SIG-scalability tech lead and + - after having passed performance testing on biggest supported clusters (unless + found unnecessary by approver) +- We can block a feature from transitioning: + - to Beta status, if (when turned on) it causes violation of already existing + performance/scalability SLOs; + - to GA status, when it can be used scale. That means: + - in rare cases, introducing a new SLI and SLO and ensuring it is met at scale + - in most of cases, extending scalability tests to use it and ensuring that + existing SLOs are still met +- We can require a SIG to introduce a regression-catching benchmark test for a + scalability-critical functionality. + +[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md + +## Roles and Organization Management + +This sig follows adheres to the Roles and Organization Management outlined in +[sig-governance] and opts-in to updates and modifications to [sig-governance]. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md + +### Subproject Creation + +SIG Scalability delegates subproject approval to Technical Leads. See [Subproject creation - Option 1]. + +[Subproject creation - Option 1]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md#subproject-creation