-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIG Scalability Charter #2149
SIG Scalability Charter #2149
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Blocking PR merges in the event of regression. | ||
|
||
As mentioned in the charter, SIG scalability has a right to block all PRs | ||
from merging into the relevant repos. This document describes the underlying | ||
"Rules of engagement" of this process and the rationale why this is needed. | ||
|
||
### Rules of engagement. | ||
The rules of engagement for blocking merges are as following: | ||
|
||
- Observe as scalability regression on one of release-blocking test suites. | ||
- Block merges of all PRs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Block merges of all PRs to the relevant repos, declaring which repos those are and why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. send out a PR with that |
||
- Identify the PR which caused the regression: | ||
- this can be done by reading code changes, bisecting, debugging based on | ||
metrics and/or logs, etc. | ||
- we say a PR is identified as the cause when we are reasonably confident | ||
that it indeed caused a regression, even if the mechanism is not 100% | ||
understood to minimize the time when merges are blocked | ||
- Mitigate the regression. This may mean e.g.: | ||
- reverting the PR | ||
- switching a feature off (preferably by default, as last resort only in tests) | ||
- fixing the problem (if it's easy and quick to fix) | ||
- Unblock PR merged. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "Unblock merges of all PRs to the relevant repos" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. send out a PR with that. |
||
|
||
The exact technical mechanisms for it are out of scope for this document. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If not here then where are these mechanisms documented? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be hidden in the test-infrastructure (in the past we were talking about opening a bug that is appropriate labeled that merge-bot is looking into). But no matter what the mechanism will be, this is implementation detail and shouldn't be part of this high-level doc. |
||
|
||
### Rationale | ||
The process described above is quite drastic, but we believe it is justified | ||
if we want kubernetes to maintain scalability SLOs. The reasoning is: | ||
- reliably testing for regressions takes a lot of time: | ||
- key scalability e2e tests take too long to execute to be a prerequisite | ||
for merging all PRs, this is an inherent characteristic of testing at scale, | ||
- end-to-end tests are flaky (even when not at scale) requiring retries, | ||
- we need to prevent regression pile-ups: | ||
- once a regression is merged, and no other action is taken, it is only | ||
a matter of time until another regression is merged on top of it, | ||
- debugging the cause of two simultaneous (piled-up) regressions is | ||
exponentially harder, see issue 53255 which links to past experience | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: link to kubernetes/kubernetes#53255 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. send out a PR for that |
||
- we need to keep flakiness of merge-blocking jobs very low: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the threshold? I suspect we are not meeting this today, so I would like to quantify how we're approaching this delta from actual to intended |
||
- regarding benchmarks, there were several scalability issues in the past | ||
caught by (costly) large-scale e2e tests, which could have been caught and | ||
fixed earlier and with far less human effort if we had benchmark-like | ||
tests. Examples include: | ||
- scheduler anti-affinity affecting kube-dns, | ||
- kubelet network plugin increasing pod-startup latency, | ||
- large responses from apiserver violating gRPC MTU. | ||
|
||
As explained in detail in an issue, not being able to maintain passing scalability | ||
tests adversely affect: | ||
- release quality | ||
- release schedule | ||
- engineer productivity |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# SIG Scalability Charter | ||
|
||
This charter adheres to the conventions described in the [Kubernetes Charter README] | ||
and uses the Roles and Organization Management outlined in [sig-governance]. | ||
|
||
[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md | ||
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md | ||
|
||
## Scope | ||
|
||
SIG Scalability's primary responsibilities are to define and drive scalability | ||
goals for Kubernetes. This involves defining, testing and measuring performance and | ||
scalability related Service Level Indicators (SLIs) and ensuring that every | ||
Kubernetes release meets Service Level Objectives (SLOs) built on top of those | ||
SLIs. | ||
|
||
We also coordinate and contribute to general system-wide scalability and | ||
performance improvements (that don't fall into the charter of another individual | ||
SIG) by driving large architectural changes and finding bottlenecks, as well as | ||
provide consultations about any scalability and performance related aspects of | ||
Kubernetes. | ||
|
||
### In Scope | ||
|
||
#### Code, Binaries and Services: | ||
|
||
- Scalability and performance testing frameworks. Examples include: | ||
- [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2) | ||
- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark) | ||
- Scalability and performance tests: | ||
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You also own the jobs that run these tests: https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
- [Jobs running those](https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability) | ||
|
||
#### Cross-cutting and Externally Facing Processes | ||
|
||
- Defining what does “Kubernetes scales” mean by defining (or approving) | ||
individual performance SLIs/SLOs, ensuring they are all oriented on user | ||
experience and consistent with each other: | ||
- [SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) | ||
- Ensuring that each official Kubernetes release satisfies all scalability and | ||
performance related requirements, as stated in "Kubernetes scalability" definition. | ||
- Establishing and documenting best practises on how to design and/or implement | ||
Kubernetes features in scalable and performant way. Educating contributors and | ||
consulting individual designs/implementations to ensure that those are widely used. | ||
Example artifacts: | ||
- [Scalability governance](https://github.com/kubernetes/community/blob/master/sig-scalability/governance) | ||
- Finding system bottlenecks and coordinating improvement on cross-cutting | ||
architectural changes. | ||
|
||
### Out of scope | ||
|
||
- Improving performance/scalability of features falling into charters of | ||
individual SIGs. | ||
|
||
|
||
## What can we do/require from other SIGs | ||
Scalability and performance are horizontal aspects of the system - changes in a | ||
single place of Kubernetes may affect the whole system. As a result, to | ||
effectively ensure Kubernetes scales, we need a special cross-SIG privileges. | ||
|
||
- We can rollback any merged PR if it has been identified as a cause of any | ||
[performance/scalability SLOs] regression (identified by the set of release | ||
blocking scalability/performance tests). The offending PR should only be | ||
merged again after proving to pass tests at scale. | ||
- In the even of a performance regression, we can block all PRs from being | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: event There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in #2617 (thanks) |
||
merged into the relevant repos until the cause of the regression is | ||
identified and mitigated. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: whitespace There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in #2617 - thanks |
||
The “Rules of engagement” of pausing merge-queue and rationale for | ||
necessity of its introduce are explained in [a separate doc](./block_merges.md). | ||
- We require significant changes (in terms of impact, such as: update of etcd, | ||
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a SIG-scalability tech lead and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see your tech leads in sigs.yaml There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding now. Following up on that - expect a PR in the next few days. |
||
- after having passed performance testing on biggest supported clusters (unless | ||
found unnecessary by approver) | ||
- We can block a feature from transitioning: | ||
- to Beta status, if (when turned on) it causes violation of already existing | ||
performance/scalability SLOs; | ||
- to GA status, when it can be used scale. That means: | ||
- in rare cases, introducing a new SLI and SLO and ensuring it is met at scale | ||
- in most of cases, extending scalability tests to use it and ensuring that | ||
existing SLOs are still met | ||
- We can require a SIG to introduce a regression-catching benchmark test for a | ||
scalability-critical functionality. | ||
|
||
[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md | ||
|
||
## Roles and Organization Management | ||
|
||
This sig follows adheres to the Roles and Organization Management outlined in | ||
[sig-governance] and opts-in to updates and modifications to [sig-governance]. | ||
|
||
[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md | ||
|
||
### Subproject Creation | ||
|
||
SIG Scalability delegates subproject approval to Technical Leads. See [Subproject creation - Option 1]. | ||
|
||
[Subproject creation - Option 1]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md#subproject-creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which suites specifically?
How is a regression observed? The suites have been continually failing, so it doesn't seem like it's a green-to-red transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "green-red transition".
The fact that tests are red now and it's not possible, shouldn simply mean that we don't have the right.
We need to transition from green to red to declare a regression.
Clarifying that.