From cc370d2d87a65d2e80b4ab2caec76c938005aa94 Mon Sep 17 00:00:00 2001 From: wojtekt Date: Mon, 14 May 2018 13:21:21 +0200 Subject: [PATCH 1/2] SIG Scalability Charter --- sig-scalability/charter.md | 170 +++++++++++++++++++++++++++++++++++++ 1 file changed, 170 insertions(+) create mode 100644 sig-scalability/charter.md diff --git a/sig-scalability/charter.md b/sig-scalability/charter.md new file mode 100644 index 00000000000..fabd7ab6972 --- /dev/null +++ b/sig-scalability/charter.md @@ -0,0 +1,170 @@ +# SIG Scalability Charter + +## Mission +The SIG Scalability helps to define scalability goals for Kubernetes, ensures +that they all play well together and ensures that every Kubernetes release meets +them by measuring performance and scalability indicators and publishing the +results. + +We also coordinate and contribute to general system-wide scalability and +performance improvements (that don’t fall into the charter of another individual +SIG) as well as provide consultations about any scalability and performance +related aspects of Kubernetes. + +## What can we do/require from other SIGs +Scalability and performance are horizontal aspects of the system - changes in a +single place of Kubernetes may affect the whole system. As a result, to +effectively ensure Kubernetes scales, we need a special cross-SIG privileges. + +- We can rollback any merged PR if it has been identified as a cause of any + [performance/scalability SLOs] regression. The offending PR should only be + merged again after proving to pass tests at scale. +- We can pause the merge queue in case of a regression observed until a particular + PR has been identified as cause of the regression and regression has been + mitigated. The “Rules of engagement” of pausing merge-queue and rationale for + necessity of its introduce are explained in a separate doc.
+ TODO(wojtek-t, shyamjvs): Write it down and link here. +- We require significant changes (in terms of impact, such as: update of etcd, + update of Go version, major architectural changes, etc.) may only be merged: + - with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) + and + - after having passed performance testing on biggest supported clusters (unless + found unnecessary by scalability approver) +- We can block a feature from transitioning to Beta status if (when turned on) it + causes a significant degradation of overall Kubernetes scalability/performance. + (Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but + initially it may also be SIG-scalability decision based on public test results). +- We can block a feature from transitioning to GA status if it cannot be used at + scale. +- We can require a SIG to introduce a regression-catching benchmark test for a + scalability-critical functionality. + +For the record, by regression above we mean a regression identified by the set +of release-blocking scalability/performance tests (as defined by +sig-release-master-blocking group of test suites). + +[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md + +## SIG Values + +- We are NOT firefighters, we are fire-prevention specialists. +- We promote deep technical understanding of the Kubernetes system and our tools. +- We strive to eliminate toil. +- We work towards building a scalable Kubernetes even in face of superlinear growth + of number of contributors. + +## Scope and subprojects +The scope of SIG Scalability covers all aspects of Kubernetes scalability and +performance. However, all issues that fully fall under a single SIG are implicitly +delegated to that SIG. + +SIG scalability subprojects are as follows. + +| Subproject | Description | Example Artifacts | OWNERS | +| --- | --- | --- | --- | +| Kubernetes scalability | Defining what does it mean that “Kubernetes scales”. This includes defining (or approving) individual performance SLIs/SLOs, ensuring they are all oriented on user experience and consistent with each other. | [SLIs/SLOs] | [OWNERS](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/OWNERS) | +| Kubernetes performance validation | Ensuring that each official Kubernetes release satisfies all scalability and performance related requirements, as state in “Kubernetes scalability” definition | [1.9 validation report] | TODO | +| Scalability testing frameworks | Designing and creating frameworks to make scalability and performance testing of Kubernetes easy and available for all contributors. Different frameworks may help in different aspect of scalability testing enabling making conscious tradeoffs, e.g. cost of accuracy or real life vs more generalized benchmarking scenarios. | [Cluster loader] | [OWNERS](https://github.com/kubernetes/perf-tests/blob/master/OWNERS) [OWNERS](https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/OWNERS) | +| Scalability and performance tests | Ensuring that all tests necessary to validate Kubernetes scalability and performance exist (ideally by providing easy-to-use framework and working with SIGs to provide them) have the environment and resources to run on and are being executed according to calendar enabling release validation. | [Scalability e2e tests] | [OWNERS](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/OWNERS) | +| Scalability governance | Establishing and documenting best practises on how to design and/or implement Kubernetes features in scalable and performant way. Educating contributors and ensuring those are widely used. | [Regressions case study] | [OWNERS](https://github.com/kubernetes/community/blob/master/sig-scalability/governance/OWNERS) | + +TODO: Figure out if we need subproject for finding bottlenecks, coordinating +improvements and architectural changes, etc. + +[SLIs/SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md +[1.9 validation report]: https://github.com/kubernetes/sig-release/blob/master/releases/release-1.9/scalability_validation_report.md +[Cluster loader]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader +[Scalability e2e tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/scalability +[Regressions case study]: https://github.com/kubernetes/community/blob/master/sig-scalability/blogs/scalability-regressions-case-studies.md + +## Roles +The following roles are required for the SIG to properly function. +In the event that any role is unfilled, the SIG will make a best effort +to fill it and any decisions reliant on a missing role will be postponed +until the role is filled. + +### Chair +- Number: 2-3 +- Run operations and processes governing the SIG +- A majority of chairs cannot be from a single company. +- An initial set of chairs was established at the time the SIG was founded as: + Wojciech Tyczynski and Bob Wise. +- Chairs may decide to step down and propose a replacement, who must be approved + by all other chairs. +- Chairs may select additional chairs by consensus. +- Chairs may be removed by consensus of other Chairs and Technical Leads if not + proactively working with other Chairs to fulfill responsibilities. + +### Technical Lead +- Number: 2-3 +- Establish new subprojects and retire existing ones +- Resolve cross-subprojects technical issues and decisions and escalations from + subprojects. +- Decision making must be by consensus. +- An initial set of technical leads was set to long-standing group of SIG leads: + Wojciech Tyczynski and Bob Wise. +- Technical leads must have demonstrated deep understanding of the whole system + that is sufficient to assess impact of different changes on Kubernetes scalability. +- Technical leads must remain active in the role and are automatically removed + from the position if they are unresponsive for >3 months. +- Technical leads may decide to step down at anytime and propose a replacement, + who must be approved by all of the other technical leads. +- TODO: Diversity across companies? + +### Subproject owners +- Number: at least 2 +- The initial owners should be established at subproject founding from relevant + OWNERS file wherever possible. +- Owners must be an escalation point for technical discussions and decisions within + the subproject. +- Owners must set milestone priorities for their subprojects. +- Owners must remain active in the role and are automatically removed from the + position if they are unresponsive for >3 months and may be removed by consensus + of the other subproject owners and all of the Technical loeads if not proactively + working to fulfill responsibilities. +- Owners may decide to step down at any time and propose replacement. Accepting + replacement will be done by lazy-consensus from other subproject owners. +- Owners may select additional subproject owners through a super-majority vote + amongst subproject owners. + +### SIG Scalability approvers +- Number: at least 3 +- Approve significant changes (in terms of potential impact, e.g. major architectural + changes, upgrades of etcd or Go version) from scalability perspective. +- An initial set of approvers was set to: + - Bob Wise + - Clayton Coleman + - Jordan Liggitt + - Shyam Jeedigunta + - Wojciech Tyczynski + +## Organizational management +- Six months after this charter is first ratified, it must be reviewed and + re-approved by the SIG in order to evaluate the assumptions made in its initial + drafting. +- SIG meets bi-weekly on zoom with agenda in meeting nodes and should be + facilitated by chair unless delegated. + +## Project management + +### Subproject creation +The initial set of subprojects owned by the SIG is defined above. +- New subprojects must be approved by consensus of SIG Technical Leads. + +### Subproject retirement +Subprojects may be retired, when they are no longer supported based on the +following criteria: +- A subproject is no longer supported when there are no active owners with + activity on the project: + - for >3 months for subprojects with no known users + - for >6 months for subprojects with known users after providing at least + 6 months notification +- Consensus amongst Technical Leads should be done to decide about retirement. + +### Technical processes +- Decisions within the scope of individual subprojects should be made by lazy + consensus by subproject owners; if a decision can’t be made, it should be + escalated to the SIG Technical leads. +- Issues impacting multiple subprojects in the SIG should be resolved by + consensus of the owners of the involved subprojects; if a decision can’t be + made, it should be escalated to the SIG Technical leads. From b6ea6539848054d838b4d9058565e0b34337133a Mon Sep 17 00:00:00 2001 From: wojtekt Date: Thu, 23 Aug 2018 12:07:52 +0200 Subject: [PATCH 2/2] Apply template --- sig-scalability/block_merges.md | 51 +++++++ sig-scalability/charter.md | 228 +++++++++++--------------------- 2 files changed, 129 insertions(+), 150 deletions(-) create mode 100644 sig-scalability/block_merges.md diff --git a/sig-scalability/block_merges.md b/sig-scalability/block_merges.md new file mode 100644 index 00000000000..d57c68a1703 --- /dev/null +++ b/sig-scalability/block_merges.md @@ -0,0 +1,51 @@ +# Blocking PR merges in the event of regression. + +As mentioned in the charter, SIG scalability has a right to block all PRs +from merging into the relevant repos. This document describes the underlying +"Rules of engagement" of this process and the rationale why this is needed. + +### Rules of engagement. +The rules of engagement for blocking merges are as following: + +- Observe as scalability regression on one of release-blocking test suites. +- Block merges of all PRs. +- Identify the PR which caused the regression: + - this can be done by reading code changes, bisecting, debugging based on + metrics and/or logs, etc. + - we say a PR is identified as the cause when we are reasonably confident + that it indeed caused a regression, even if the mechanism is not 100% + understood to minimize the time when merges are blocked +- Mitigate the regression. This may mean e.g.: + - reverting the PR + - switching a feature off (preferably by default, as last resort only in tests) + - fixing the problem (if it's easy and quick to fix) +- Unblock PR merged. + +The exact technical mechanisms for it are out of scope for this document. + +### Rationale +The process described above is quite drastic, but we believe it is justified +if we want kubernetes to maintain scalability SLOs. The reasoning is: +- reliably testing for regressions takes a lot of time: + - key scalability e2e tests take too long to execute to be a prerequisite + for merging all PRs, this is an inherent characteristic of testing at scale, + - end-to-end tests are flaky (even when not at scale) requiring retries, +- we need to prevent regression pile-ups: + - once a regression is merged, and no other action is taken, it is only + a matter of time until another regression is merged on top of it, + - debugging the cause of two simultaneous (piled-up) regressions is + exponentially harder, see issue 53255 which links to past experience +- we need to keep flakiness of merge-blocking jobs very low: +- regarding benchmarks, there were several scalability issues in the past + caught by (costly) large-scale e2e tests, which could have been caught and + fixed earlier and with far less human effort if we had benchmark-like + tests. Examples include: + - scheduler anti-affinity affecting kube-dns, + - kubelet network plugin increasing pod-startup latency, + - large responses from apiserver violating gRPC MTU. + +As explained in detail in an issue, not being able to maintain passing scalability +tests adversely affect: +- release quality +- release schedule +- engineer productivity diff --git a/sig-scalability/charter.md b/sig-scalability/charter.md index fabd7ab6972..acb051d3f12 100644 --- a/sig-scalability/charter.md +++ b/sig-scalability/charter.md @@ -1,15 +1,57 @@ # SIG Scalability Charter -## Mission -The SIG Scalability helps to define scalability goals for Kubernetes, ensures -that they all play well together and ensures that every Kubernetes release meets -them by measuring performance and scalability indicators and publishing the -results. +This charter adheres to the conventions described in the [Kubernetes Charter README] +and uses the Roles and Organization Management outlined in [sig-governance]. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md + +## Scope + +SIG Scalability's primary responsibilities are to define and drive scalability +goals for Kubernetes. This involves defining, testing and measuring performance and +scalability related Service Level Indicators (SLIs) and ensuring that every +Kubernetes release meets Service Level Objectives (SLOs) built on top of those +SLIs. We also coordinate and contribute to general system-wide scalability and -performance improvements (that don’t fall into the charter of another individual -SIG) as well as provide consultations about any scalability and performance -related aspects of Kubernetes. +performance improvements (that don't fall into the charter of another individual +SIG) by driving large architectural changes and finding bottlenecks, as well as +provide consultations about any scalability and performance related aspects of +Kubernetes. + +### In Scope + +#### Code, Binaries and Services: + +- Scalability and performance testing frameworks. Examples include: + - [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2) + - [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark) +- Scalability and performance tests: + - [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/) + - [Jobs running those](https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability) + +#### Cross-cutting and Externally Facing Processes + +- Defining what does “Kubernetes scales” mean by defining (or approving) +individual performance SLIs/SLOs, ensuring they are all oriented on user +experience and consistent with each other: + - [SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) +- Ensuring that each official Kubernetes release satisfies all scalability and +performance related requirements, as stated in "Kubernetes scalability" definition. +- Establishing and documenting best practises on how to design and/or implement +Kubernetes features in scalable and performant way. Educating contributors and +consulting individual designs/implementations to ensure that those are widely used. +Example artifacts: + - [Scalability governance](https://github.com/kubernetes/community/blob/master/sig-scalability/governance) +- Finding system bottlenecks and coordinating improvement on cross-cutting +architectural changes. + +### Out of scope + +- Improving performance/scalability of features falling into charters of +individual SIGs. + ## What can we do/require from other SIGs Scalability and performance are horizontal aspects of the system - changes in a @@ -17,154 +59,40 @@ single place of Kubernetes may affect the whole system. As a result, to effectively ensure Kubernetes scales, we need a special cross-SIG privileges. - We can rollback any merged PR if it has been identified as a cause of any - [performance/scalability SLOs] regression. The offending PR should only be + [performance/scalability SLOs] regression (identified by the set of release + blocking scalability/performance tests). The offending PR should only be merged again after proving to pass tests at scale. -- We can pause the merge queue in case of a regression observed until a particular - PR has been identified as cause of the regression and regression has been - mitigated. The “Rules of engagement” of pausing merge-queue and rationale for - necessity of its introduce are explained in a separate doc.
- TODO(wojtek-t, shyamjvs): Write it down and link here. +- In the even of a performance regression, we can block all PRs from being + merged into the relevant repos until the cause of the regression is + identified and mitigated. + The “Rules of engagement” of pausing merge-queue and rationale for + necessity of its introduce are explained in [a separate doc](./block_merges.md). - We require significant changes (in terms of impact, such as: update of etcd, update of Go version, major architectural changes, etc.) may only be merged: - - with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) - and + - with an explicit approval from a SIG-scalability tech lead and - after having passed performance testing on biggest supported clusters (unless - found unnecessary by scalability approver) -- We can block a feature from transitioning to Beta status if (when turned on) it - causes a significant degradation of overall Kubernetes scalability/performance. - (Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but - initially it may also be SIG-scalability decision based on public test results). -- We can block a feature from transitioning to GA status if it cannot be used at - scale. + found unnecessary by approver) +- We can block a feature from transitioning: + - to Beta status, if (when turned on) it causes violation of already existing + performance/scalability SLOs; + - to GA status, when it can be used scale. That means: + - in rare cases, introducing a new SLI and SLO and ensuring it is met at scale + - in most of cases, extending scalability tests to use it and ensuring that + existing SLOs are still met - We can require a SIG to introduce a regression-catching benchmark test for a scalability-critical functionality. -For the record, by regression above we mean a regression identified by the set -of release-blocking scalability/performance tests (as defined by -sig-release-master-blocking group of test suites). - [performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md -## SIG Values - -- We are NOT firefighters, we are fire-prevention specialists. -- We promote deep technical understanding of the Kubernetes system and our tools. -- We strive to eliminate toil. -- We work towards building a scalable Kubernetes even in face of superlinear growth - of number of contributors. - -## Scope and subprojects -The scope of SIG Scalability covers all aspects of Kubernetes scalability and -performance. However, all issues that fully fall under a single SIG are implicitly -delegated to that SIG. - -SIG scalability subprojects are as follows. - -| Subproject | Description | Example Artifacts | OWNERS | -| --- | --- | --- | --- | -| Kubernetes scalability | Defining what does it mean that “Kubernetes scales”. This includes defining (or approving) individual performance SLIs/SLOs, ensuring they are all oriented on user experience and consistent with each other. | [SLIs/SLOs] | [OWNERS](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/OWNERS) | -| Kubernetes performance validation | Ensuring that each official Kubernetes release satisfies all scalability and performance related requirements, as state in “Kubernetes scalability” definition | [1.9 validation report] | TODO | -| Scalability testing frameworks | Designing and creating frameworks to make scalability and performance testing of Kubernetes easy and available for all contributors. Different frameworks may help in different aspect of scalability testing enabling making conscious tradeoffs, e.g. cost of accuracy or real life vs more generalized benchmarking scenarios. | [Cluster loader] | [OWNERS](https://github.com/kubernetes/perf-tests/blob/master/OWNERS) [OWNERS](https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/OWNERS) | -| Scalability and performance tests | Ensuring that all tests necessary to validate Kubernetes scalability and performance exist (ideally by providing easy-to-use framework and working with SIGs to provide them) have the environment and resources to run on and are being executed according to calendar enabling release validation. | [Scalability e2e tests] | [OWNERS](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/OWNERS) | -| Scalability governance | Establishing and documenting best practises on how to design and/or implement Kubernetes features in scalable and performant way. Educating contributors and ensuring those are widely used. | [Regressions case study] | [OWNERS](https://github.com/kubernetes/community/blob/master/sig-scalability/governance/OWNERS) | - -TODO: Figure out if we need subproject for finding bottlenecks, coordinating -improvements and architectural changes, etc. - -[SLIs/SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md -[1.9 validation report]: https://github.com/kubernetes/sig-release/blob/master/releases/release-1.9/scalability_validation_report.md -[Cluster loader]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader -[Scalability e2e tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/scalability -[Regressions case study]: https://github.com/kubernetes/community/blob/master/sig-scalability/blogs/scalability-regressions-case-studies.md - -## Roles -The following roles are required for the SIG to properly function. -In the event that any role is unfilled, the SIG will make a best effort -to fill it and any decisions reliant on a missing role will be postponed -until the role is filled. - -### Chair -- Number: 2-3 -- Run operations and processes governing the SIG -- A majority of chairs cannot be from a single company. -- An initial set of chairs was established at the time the SIG was founded as: - Wojciech Tyczynski and Bob Wise. -- Chairs may decide to step down and propose a replacement, who must be approved - by all other chairs. -- Chairs may select additional chairs by consensus. -- Chairs may be removed by consensus of other Chairs and Technical Leads if not - proactively working with other Chairs to fulfill responsibilities. - -### Technical Lead -- Number: 2-3 -- Establish new subprojects and retire existing ones -- Resolve cross-subprojects technical issues and decisions and escalations from - subprojects. -- Decision making must be by consensus. -- An initial set of technical leads was set to long-standing group of SIG leads: - Wojciech Tyczynski and Bob Wise. -- Technical leads must have demonstrated deep understanding of the whole system - that is sufficient to assess impact of different changes on Kubernetes scalability. -- Technical leads must remain active in the role and are automatically removed - from the position if they are unresponsive for >3 months. -- Technical leads may decide to step down at anytime and propose a replacement, - who must be approved by all of the other technical leads. -- TODO: Diversity across companies? - -### Subproject owners -- Number: at least 2 -- The initial owners should be established at subproject founding from relevant - OWNERS file wherever possible. -- Owners must be an escalation point for technical discussions and decisions within - the subproject. -- Owners must set milestone priorities for their subprojects. -- Owners must remain active in the role and are automatically removed from the - position if they are unresponsive for >3 months and may be removed by consensus - of the other subproject owners and all of the Technical loeads if not proactively - working to fulfill responsibilities. -- Owners may decide to step down at any time and propose replacement. Accepting - replacement will be done by lazy-consensus from other subproject owners. -- Owners may select additional subproject owners through a super-majority vote - amongst subproject owners. - -### SIG Scalability approvers -- Number: at least 3 -- Approve significant changes (in terms of potential impact, e.g. major architectural - changes, upgrades of etcd or Go version) from scalability perspective. -- An initial set of approvers was set to: - - Bob Wise - - Clayton Coleman - - Jordan Liggitt - - Shyam Jeedigunta - - Wojciech Tyczynski - -## Organizational management -- Six months after this charter is first ratified, it must be reviewed and - re-approved by the SIG in order to evaluate the assumptions made in its initial - drafting. -- SIG meets bi-weekly on zoom with agenda in meeting nodes and should be - facilitated by chair unless delegated. - -## Project management - -### Subproject creation -The initial set of subprojects owned by the SIG is defined above. -- New subprojects must be approved by consensus of SIG Technical Leads. - -### Subproject retirement -Subprojects may be retired, when they are no longer supported based on the -following criteria: -- A subproject is no longer supported when there are no active owners with - activity on the project: - - for >3 months for subprojects with no known users - - for >6 months for subprojects with known users after providing at least - 6 months notification -- Consensus amongst Technical Leads should be done to decide about retirement. - -### Technical processes -- Decisions within the scope of individual subprojects should be made by lazy - consensus by subproject owners; if a decision can’t be made, it should be - escalated to the SIG Technical leads. -- Issues impacting multiple subprojects in the SIG should be resolved by - consensus of the owners of the involved subprojects; if a decision can’t be - made, it should be escalated to the SIG Technical leads. +## Roles and Organization Management + +This sig follows adheres to the Roles and Organization Management outlined in +[sig-governance] and opts-in to updates and modifications to [sig-governance]. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md + +### Subproject Creation + +SIG Scalability delegates subproject approval to Technical Leads. See [Subproject creation - Option 1]. + +[Subproject creation - Option 1]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md#subproject-creation