-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIG Scalability Charter #2149
SIG Scalability Charter #2149
Conversation
sig-scalability/charter.md
Outdated
- An initial set of technical leads was set to long-standing group of SIG leads: | ||
Wojciech Tyczynski and Bob Wise | ||
- Technical leads must have demonstrated deep understanding of the whole system | ||
that is sufficient to access impact of different changes on Kubernetes scalability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/access/assess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
- Number: 2-3 | ||
- Run operations and processes governing the SIG | ||
- A majority of chairs cannot be from a single company. | ||
- An initial set of chairs was established at the time the SIG was founded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who are the initial chairs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bob Wise and myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put it into the charter given that you have put TLs' names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
18826ab
to
786f992
Compare
/approve |
@thockin @smarterclayton @bgrant0607 @spiffxp - can you please take a look? |
This is in the Steering Committee backlog under: kubernetes/steering#31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK but could do with some more contextual linking and information as commented.
sig-scalability/charter.md
Outdated
single place of Kubernetes may affect the whole system. As a result, to | ||
effectively ensure Kubernetes scales, we need a special cross-SIG privileges. | ||
|
||
- We can rollback any merged PR if it has been identified as a cause of an SLO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to the SLOs? Where are they? Who defines them? Is the API machinery one below the only example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked (that answers where they are).
Who defines them: sig-scalability with cooperation of sigs that these SLIs/SLOs are touching.
Re API-machinery one - yes (though it was reoriganized recently so updated the links).
sig-scalability/charter.md
Outdated
<td>Cluster loader</td> | ||
</tr> | ||
<tr> | ||
<td>Scalability and performanc tests</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfomanc spelling error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
TODO(wojtek-t, shyamjvs): Write it down and link here. | ||
- We require significant changes (in terms of impact, such as: update of etcd, | ||
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a SIG-scalability approver and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are the approvers documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added subsection in the roles section.
sig-scalability/charter.md
Outdated
<tr> | ||
<td>Subproject</td> | ||
<td>Description</td> | ||
<td>Example Artifacts</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this table have a column for a link to the OWNERS file? It requires a bit of digging for each of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (modulo one line, which I would like to figure out independently in parallel - hopefully that doesn't block charter).
33a545a
to
dfc57ed
Compare
95a5f64
to
8b24001
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
sig-scalability/charter.md
Outdated
## Scope | ||
|
||
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes, define, test and measure performance and scalability related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion - done.
sig-scalability/charter.md
Outdated
|
||
[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md | ||
|
||
### SIG Scalability approvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I thought a bit more about that, and:
- longer term it may make sense to move to sigs.yaml
- currently, it's not supported there (it's somewhat sig-specific role) and i don't want to add that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. There are a couple of areas I would like to see some more explicit definition of scope. I may not have time to review the next iteration until Thursday.
sig-scalability/charter.md
Outdated
|
||
[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md | ||
|
||
### SIG Scalability approvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like you're listing Tech Leads, which yes, should be in sigs.yaml, and is supported as leadership.tech_leads
If you're trying to define a decision making process (ie: at least 3 of these must approve, any one of these has the authority to approve, etc) I would instead put this under a Technical Processes header, and add boilerplate about how you're overriding sig-governance in the following ways...
sig-scalability/charter.md
Outdated
|
||
## Scope | ||
|
||
SIG Scalability primary responsibilities are to define scalability goals for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: SIG Scalability's
to define and drive scalability goals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
|
||
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes. This involves defining, testing and measuring performance and | ||
scalability related to Service Level Indicators (SLIs) and ensure that every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/to//
nit: s/ensure/ensuring/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes. This involves defining, testing and measuring performance and | ||
scalability related to Service Level Indicators (SLIs) and ensure that every | ||
Kubernetes release meets Service Level Objectives (SLOs) build on top of those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/build/built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tighten this statement up a bit. By saying "release" this implies you're only enforcing Kubernetes releases meet these SLOs, and that it is OK for alphas, betas, rcs, and other builds to fail to meet them. Your processes below make me think this is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully agree. What we really care is that the release that our users use in production (I really doubt someone is using alpha or beta cut in production) meet scalability needs.
So in theory I would say it's fine if those doesn't meet SLOs.
In practise, it's hard to do stuff and only then see if we didn't regress. So we're trying to ensure that those are met all the time (though we are not there yet).
But this is "implementation detail of the strategy", not something that should be part of the charter in my opinion.
sig-scalability/charter.md
Outdated
## Scope | ||
|
||
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes. This involves defining, testing and measuring performance and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can more concretely define which parts of Kubernetes you're setting scalability goals for? It might help to refer to the architecture roadmap eg:
- On one extreme, you probably don't intend to have purview over the scalability goals of everything in the entire Kubernetes project (which I define as all repos in kubernetes, kubernetes-sigs, kubernetes-client, etc)
- On the other extreme, you probably have purview over more than just the contents of kubernetes/kubernetes (and even within this, you probably don't have purview over all of the addons contained within)
- You probably don't mean scalability of the project itself, since that falls under contributor experience and to some degree architecture
Another way to consider this: what plugins, runtimes, hardware, workloads, etc. result in a cluster not being held to scalability SLO's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth considering taking an approach similar to SIG Instrumentation, another horizontal SIG with some areas of discrete ownership:
- Own performance and scalability best practices for the project as a whole
- Provide reviews, and guidance to other SIGs
- Own the definition, implementation, and enforcement of scalability SLIs and SLOs for the artifacts produced as part of a Kubernetes release
sig-scalability/charter.md
Outdated
|
||
- We can rollback any merged PR if it has been identified as a cause of any | ||
[performance/scalability SLOs] regression. The offending PR should only be | ||
merged again after proving to pass tests at scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily have to do this within this charter, but I'm going to suggest that the release team require regular participation from a member of SIG Scalability until we have such automation in place. Chasing down scalability related failures is too much an additional burden on the CI Signal role, and we often have little visibility into whether or how failures are being worked.
As an example both https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance and https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-correctness have been failing for over two weeks. This is being "explained away" as a quota issue, with code continuing to merge, rather than being enforced as a hard stop until the cause is identified.
sig-scalability/charter.md
Outdated
TODO(wojtek-t, shyamjvs): Write it down and link here. | ||
- We require significant changes (in terms of impact, such as: update of etcd, | ||
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tech lead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be tech lead - it's a much wider role and this group may involve other people not necessary being members of sig scalability.
But I'm tired explaining this over and over again, so I'm changing this to tech leads with a TODO for the future.
sig-scalability/charter.md
Outdated
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) | ||
and | ||
- after having passed performance testing on biggest supported clusters (unless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?
How does someone run these jobs as a presubmit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?
If they will be added as release blockers - then yes.
How does someone run these jobs as a presubmit?
They are already optional presubmits that one can manually trigger.
But this is technical detail - such things shouldn't be part of charter.
sig-scalability/charter.md
Outdated
causes a significant degradation of overall Kubernetes scalability/performance. | ||
(Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but | ||
initially it may also be SIG-scalability decision based on public test results). | ||
- We can block a feature from transitioning to GA status if it cannot be used at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different than the requirements spelled out for beta status? I'd like to see this phrased in terms of SLI/SLO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was clear than it apparently is. The overall idea is:
-
Alpha->Beta: this can't break existing SLOs (but feature itself doesn't need to scale well)
[As an example: we used to have a store with pod anti-affinity that even if it wasn't used, it was completely breaking scheduling throughput. That's not acceptable.] -
Beta->GA: feature itself has to scale.
This may (in rare cases) require introducing a new SLI/SLO to verify that, but in majority just using that feature in our scalability tests and verifying that already existing SLOs are met.
Rephrased to better match what i wrote above.
sig-scalability/charter.md
Outdated
and | ||
- after having passed performance testing on biggest supported clusters (unless | ||
found unnecessary by scalability approver) | ||
- We can block a feature from transitioning to Beta status if (when turned on) it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea. But, you might want to consider rephrasing it such that SIG Architecture actually owns this enforcement, and you are in charge of defining the scalability SLI's/SLO's that they use. Ultimately I think it's going to be less confusing for feature owners to look to a single SIG for go/no-go, and use a checklist of requirements you've helped write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - it's definitely not a goal to have a dedicated SLO for every single feature. That doesn't scale and such a huge list of SLIs would be impossible to track especially for users.
About moving the responsibility to sig architecture: TBH I don't really understand your point. SIG architecture is NOT stamping every single transition between phases for every single feature from what I know.
Assuming that I'm not missing anything here, the point of moving the responsibility to sig architecture doesn't make sense because ultimately it would be us saying sig architecture if they say yes or no. That doesn't make sense to me.
8b24001
to
f9777ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
sig-scalability/charter.md
Outdated
|
||
## Scope | ||
|
||
SIG Scalability primary responsibilities are to define scalability goals for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
|
||
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes. This involves defining, testing and measuring performance and | ||
scalability related to Service Level Indicators (SLIs) and ensure that every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
SIG Scalability primary responsibilities are to define scalability goals for | ||
Kubernetes. This involves defining, testing and measuring performance and | ||
scalability related to Service Level Indicators (SLIs) and ensure that every | ||
Kubernetes release meets Service Level Objectives (SLOs) build on top of those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully agree. What we really care is that the release that our users use in production (I really doubt someone is using alpha or beta cut in production) meet scalability needs.
So in theory I would say it's fine if those doesn't meet SLOs.
In practise, it's hard to do stuff and only then see if we didn't regress. So we're trying to ensure that those are met all the time (though we are not there yet).
But this is "implementation detail of the strategy", not something that should be part of the charter in my opinion.
- [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2) | ||
- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark) | ||
- Scalability and performance tests: | ||
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sig-scalability/charter.md
Outdated
- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark) | ||
- Scalability and performance tests: | ||
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than kubemark and tests themselves, no.
sig-scalability/charter.md
Outdated
TODO(wojtek-t, shyamjvs): Write it down and link here. | ||
- We require significant changes (in terms of impact, such as: update of etcd, | ||
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be tech lead - it's a much wider role and this group may involve other people not necessary being members of sig scalability.
But I'm tired explaining this over and over again, so I'm changing this to tech leads with a TODO for the future.
sig-scalability/charter.md
Outdated
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers) | ||
and | ||
- after having passed performance testing on biggest supported clusters (unless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?
If they will be added as release blockers - then yes.
How does someone run these jobs as a presubmit?
They are already optional presubmits that one can manually trigger.
But this is technical detail - such things shouldn't be part of charter.
sig-scalability/charter.md
Outdated
and | ||
- after having passed performance testing on biggest supported clusters (unless | ||
found unnecessary by scalability approver) | ||
- We can block a feature from transitioning to Beta status if (when turned on) it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - it's definitely not a goal to have a dedicated SLO for every single feature. That doesn't scale and such a huge list of SLIs would be impossible to track especially for users.
About moving the responsibility to sig architecture: TBH I don't really understand your point. SIG architecture is NOT stamping every single transition between phases for every single feature from what I know.
Assuming that I'm not missing anything here, the point of moving the responsibility to sig architecture doesn't make sense because ultimately it would be us saying sig architecture if they say yes or no. That doesn't make sense to me.
sig-scalability/charter.md
Outdated
causes a significant degradation of overall Kubernetes scalability/performance. | ||
(Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but | ||
initially it may also be SIG-scalability decision based on public test results). | ||
- We can block a feature from transitioning to GA status if it cannot be used at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was clear than it apparently is. The overall idea is:
-
Alpha->Beta: this can't break existing SLOs (but feature itself doesn't need to scale well)
[As an example: we used to have a store with pod anti-affinity that even if it wasn't used, it was completely breaking scheduling throughput. That's not acceptable.] -
Beta->GA: feature itself has to scale.
This may (in rare cases) require introducing a new SLI/SLO to verify that, but in majority just using that feature in our scalability tests and verifying that already existing SLOs are met.
Rephrased to better match what i wrote above.
sig-scalability/charter.md
Outdated
|
||
[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md | ||
|
||
### SIG Scalability approvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave up for now and switched to tech leads with a TODO to create a dedicated group for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fixup the TODOs and minor wording order, then LGTM.
sig-scalability/charter.md
Outdated
identified and mitigated. | ||
The “Rules of engagement” of pausing merge-queue and rationale for | ||
necessity of its introduce are explained in a separate doc. <br/> | ||
TODO(wojtek-t, shyamjvs): Write it down and link here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
sig-scalability/charter.md
Outdated
- We can require a SIG to introduce a regression-catching benchmark test for a | ||
scalability-critical functionality. | ||
|
||
For the record, by regression above we mean a regression identified by the set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might just want to define taxonomy at the beginning or end, this seems weirdly placed. Could just remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f9777ff
to
e2e2569
Compare
PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing then LGTM
sig-scalability/charter.md
Outdated
- with an explicit approval from a SIG-scalability tech lead and | ||
- after having passed performance testing on biggest supported clusters (unless | ||
found unnecessary by approver) | ||
TODO(wojtek-t): Create a larger group (whose members doesn't have to necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO - still exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to resolve it now - this is a bigger thing and for now I think it's fine to start with tech leads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just remove that sentence then, and PR the charter at a later time when TODO is less ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense - done.
sig-scalability/block_merges.md
Outdated
- Mitigate the regression. This may mean e.g.: | ||
- reverting the PR | ||
- switching a feature off (preferably by default, as last resort only in tests) | ||
- ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this ...
. It doesn't add anything to the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough - done.
e2e2569
to
b6ea653
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments - PTAL
sig-scalability/block_merges.md
Outdated
- Mitigate the regression. This may mean e.g.: | ||
- reverting the PR | ||
- switching a feature off (preferably by default, as last resort only in tests) | ||
- ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough - done.
sig-scalability/charter.md
Outdated
- with an explicit approval from a SIG-scalability tech lead and | ||
- after having passed performance testing on biggest supported clusters (unless | ||
found unnecessary by approver) | ||
TODO(wojtek-t): Create a larger group (whose members doesn't have to necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense - done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@wojtek-t thank you for seeing this through!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: countspongebob, timothysc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timothysc I do not appreciate that this was merged without me having gotten another pass at it. @wojtek-t reached out to me privately asking for another look, which I was in the middle of doing when this was merged.
There a few things that need to be followed up on
[performance/scalability SLOs] regression (identified by the set of release | ||
blocking scalability/performance tests). The offending PR should only be | ||
merged again after proving to pass tests at scale. | ||
- In the even of a performance regression, we can block all PRs from being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2617 (thanks)
merged again after proving to pass tests at scale. | ||
- In the even of a performance regression, we can block all PRs from being | ||
merged into the relevant repos until the cause of the regression is | ||
identified and mitigated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2617 - thanks
necessity of its introduce are explained in [a separate doc](./block_merges.md). | ||
- We require significant changes (in terms of impact, such as: update of etcd, | ||
update of Go version, major architectural changes, etc.) may only be merged: | ||
- with an explicit approval from a SIG-scalability tech lead and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see your tech leads in sigs.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding now. Following up on that - expect a PR in the next few days.
### Rules of engagement. | ||
The rules of engagement for blocking merges are as following: | ||
|
||
- Observe as scalability regression on one of release-blocking test suites. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which suites specifically?
How is a regression observed? The suites have been continually failing, so it doesn't seem like it's a green-to-red transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "green-red transition".
The fact that tests are red now and it's not possible, shouldn simply mean that we don't have the right.
We need to transition from green to red to declare a regression.
Clarifying that.
- reverting the PR | ||
- switching a feature off (preferably by default, as last resort only in tests) | ||
- fixing the problem (if it's easy and quick to fix) | ||
- Unblock PR merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Unblock merges of all PRs to the relevant repos"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
send out a PR with that.
The rules of engagement for blocking merges are as following: | ||
|
||
- Observe as scalability regression on one of release-blocking test suites. | ||
- Block merges of all PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block merges of all PRs to the relevant repos, declaring which repos those are and why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
send out a PR with that
- fixing the problem (if it's easy and quick to fix) | ||
- Unblock PR merged. | ||
|
||
The exact technical mechanisms for it are out of scope for this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not here then where are these mechanisms documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be hidden in the test-infrastructure (in the past we were talking about opening a bug that is appropriate labeled that merge-bot is looking into). But no matter what the mechanism will be, this is implementation detail and shouldn't be part of this high-level doc.
- once a regression is merged, and no other action is taken, it is only | ||
a matter of time until another regression is merged on top of it, | ||
- debugging the cause of two simultaneous (piled-up) regressions is | ||
exponentially harder, see issue 53255 which links to past experience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: link to kubernetes/kubernetes#53255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
send out a PR for that
a matter of time until another regression is merged on top of it, | ||
- debugging the cause of two simultaneous (piled-up) regressions is | ||
exponentially harder, see issue 53255 which links to past experience | ||
- we need to keep flakiness of merge-blocking jobs very low: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the threshold? I suspect we are not meeting this today, so I would like to quantify how we're approaching this delta from actual to intended
@timothysc Next time please give a last call ping prior to merging |
@kubernetes/sig-scalability-api-reviews
@countspongebob @shyamjvs @wasylkowski-a @gmarek
cc-ing everyone who was on the initial version:
/cc @thockin @smarterclayton @spiffxp @jdumars @bgrant0607 @timothysc