Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIG Scalability Charter #2149

Merged
merged 2 commits into from
Aug 31, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions sig-scalability/block_merges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Blocking PR merges in the event of regression.

As mentioned in the charter, SIG scalability has a right to block all PRs
from merging into the relevant repos. This document describes the underlying
"Rules of engagement" of this process and the rationale why this is needed.

### Rules of engagement.
The rules of engagement for blocking merges are as following:

- Observe as scalability regression on one of release-blocking test suites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which suites specifically?

How is a regression observed? The suites have been continually failing, so it doesn't seem like it's a green-to-red transition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "green-red transition".
The fact that tests are red now and it's not possible, shouldn simply mean that we don't have the right.
We need to transition from green to red to declare a regression.

Clarifying that.

- Block merges of all PRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block merges of all PRs to the relevant repos, declaring which repos those are and why

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR with that

- Identify the PR which caused the regression:
- this can be done by reading code changes, bisecting, debugging based on
metrics and/or logs, etc.
- we say a PR is identified as the cause when we are reasonably confident
that it indeed caused a regression, even if the mechanism is not 100%
understood to minimize the time when merges are blocked
- Mitigate the regression. This may mean e.g.:
- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- fixing the problem (if it's easy and quick to fix)
- Unblock PR merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Unblock merges of all PRs to the relevant repos"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR with that.


The exact technical mechanisms for it are out of scope for this document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not here then where are these mechanisms documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be hidden in the test-infrastructure (in the past we were talking about opening a bug that is appropriate labeled that merge-bot is looking into). But no matter what the mechanism will be, this is implementation detail and shouldn't be part of this high-level doc.


### Rationale
The process described above is quite drastic, but we believe it is justified
if we want kubernetes to maintain scalability SLOs. The reasoning is:
- reliably testing for regressions takes a lot of time:
- key scalability e2e tests take too long to execute to be a prerequisite
for merging all PRs, this is an inherent characteristic of testing at scale,
- end-to-end tests are flaky (even when not at scale) requiring retries,
- we need to prevent regression pile-ups:
- once a regression is merged, and no other action is taken, it is only
a matter of time until another regression is merged on top of it,
- debugging the cause of two simultaneous (piled-up) regressions is
exponentially harder, see issue 53255 which links to past experience
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR for that

- we need to keep flakiness of merge-blocking jobs very low:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the threshold? I suspect we are not meeting this today, so I would like to quantify how we're approaching this delta from actual to intended

- regarding benchmarks, there were several scalability issues in the past
caught by (costly) large-scale e2e tests, which could have been caught and
fixed earlier and with far less human effort if we had benchmark-like
tests. Examples include:
- scheduler anti-affinity affecting kube-dns,
- kubelet network plugin increasing pod-startup latency,
- large responses from apiserver violating gRPC MTU.

As explained in detail in an issue, not being able to maintain passing scalability
tests adversely affect:
- release quality
- release schedule
- engineer productivity
98 changes: 98 additions & 0 deletions sig-scalability/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# SIG Scalability Charter

This charter adheres to the conventions described in the [Kubernetes Charter README]
and uses the Roles and Organization Management outlined in [sig-governance].

[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md

## Scope

SIG Scalability's primary responsibilities are to define and drive scalability
goals for Kubernetes. This involves defining, testing and measuring performance and
scalability related Service Level Indicators (SLIs) and ensuring that every
Kubernetes release meets Service Level Objectives (SLOs) built on top of those
SLIs.

We also coordinate and contribute to general system-wide scalability and
performance improvements (that don't fall into the charter of another individual
SIG) by driving large architectural changes and finding bottlenecks, as well as
provide consultations about any scalability and performance related aspects of
Kubernetes.

### In Scope

#### Code, Binaries and Services:

- Scalability and performance testing frameworks. Examples include:
- [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2)
- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark)
- Scalability and performance tests:
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- [Jobs running those](https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability)

#### Cross-cutting and Externally Facing Processes

- Defining what does “Kubernetes scales” mean by defining (or approving)
individual performance SLIs/SLOs, ensuring they are all oriented on user
experience and consistent with each other:
- [SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md)
- Ensuring that each official Kubernetes release satisfies all scalability and
performance related requirements, as stated in "Kubernetes scalability" definition.
- Establishing and documenting best practises on how to design and/or implement
Kubernetes features in scalable and performant way. Educating contributors and
consulting individual designs/implementations to ensure that those are widely used.
Example artifacts:
- [Scalability governance](https://github.com/kubernetes/community/blob/master/sig-scalability/governance)
- Finding system bottlenecks and coordinating improvement on cross-cutting
architectural changes.

### Out of scope

- Improving performance/scalability of features falling into charters of
individual SIGs.


## What can we do/require from other SIGs
Scalability and performance are horizontal aspects of the system - changes in a
single place of Kubernetes may affect the whole system. As a result, to
effectively ensure Kubernetes scales, we need a special cross-SIG privileges.

- We can rollback any merged PR if it has been identified as a cause of any
[performance/scalability SLOs] regression (identified by the set of release
blocking scalability/performance tests). The offending PR should only be
merged again after proving to pass tests at scale.
- In the even of a performance regression, we can block all PRs from being
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: event

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2617 (thanks)

merged into the relevant repos until the cause of the regression is
identified and mitigated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2617 - thanks

The “Rules of engagement” of pausing merge-queue and rationale for
necessity of its introduce are explained in [a separate doc](./block_merges.md).
- We require significant changes (in terms of impact, such as: update of etcd,
update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a SIG-scalability tech lead and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see your tech leads in sigs.yaml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding now. Following up on that - expect a PR in the next few days.

- after having passed performance testing on biggest supported clusters (unless
found unnecessary by approver)
- We can block a feature from transitioning:
- to Beta status, if (when turned on) it causes violation of already existing
performance/scalability SLOs;
- to GA status, when it can be used scale. That means:
- in rare cases, introducing a new SLI and SLO and ensuring it is met at scale
- in most of cases, extending scalability tests to use it and ensuring that
existing SLOs are still met
- We can require a SIG to introduce a regression-catching benchmark test for a
scalability-critical functionality.

[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

## Roles and Organization Management

This sig follows adheres to the Roles and Organization Management outlined in
[sig-governance] and opts-in to updates and modifications to [sig-governance].

[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md

### Subproject Creation

SIG Scalability delegates subproject approval to Technical Leads. See [Subproject creation - Option 1].

[Subproject creation - Option 1]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md#subproject-creation