Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIG Scalability Charter #2149

Merged
merged 2 commits into from
Aug 31, 2018

Conversation

wojtek-t
Copy link
Member

@kubernetes/sig-scalability-api-reviews
@countspongebob @shyamjvs @wasylkowski-a @gmarek

cc-ing everyone who was on the initial version:
/cc @thockin @smarterclayton @spiffxp @jdumars @bgrant0607 @timothysc

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 14, 2018
This was referenced May 14, 2018
- An initial set of technical leads was set to long-standing group of SIG leads:
Wojciech Tyczynski and Bob Wise
- Technical leads must have demonstrated deep understanding of the whole system
that is sufficient to access impact of different changes on Kubernetes scalability.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/access/assess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Number: 2-3
- Run operations and processes governing the SIG
- A majority of chairs cannot be from a single company.
- An initial set of chairs was established at the time the SIG was founded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who are the initial chairs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bob Wise and myself

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put it into the charter given that you have put TLs' names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch from 18826ab to 786f992 Compare May 18, 2018 09:43
@pwittrock pwittrock added the committee/steering Denotes an issue or PR intended to be handled by the steering committee. label Jun 21, 2018
@countspongebob
Copy link
Contributor

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2018
@wojtek-t
Copy link
Member Author

@thockin @smarterclayton @bgrant0607 @spiffxp - can you please take a look?
Is there any official procedure to have the charter approved?

@jdumars
Copy link
Member

jdumars commented Jun 29, 2018

This is in the Steering Committee backlog under: kubernetes/steering#31

philips
philips previously requested changes Jul 3, 2018
Copy link
Contributor

@philips philips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK but could do with some more contextual linking and information as commented.

single place of Kubernetes may affect the whole system. As a result, to
effectively ensure Kubernetes scales, we need a special cross-SIG privileges.

- We can rollback any merged PR if it has been identified as a cause of an SLO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to the SLOs? Where are they? Who defines them? Is the API machinery one below the only example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked (that answers where they are).

Who defines them: sig-scalability with cooperation of sigs that these SLIs/SLOs are touching.

Re API-machinery one - yes (though it was reoriganized recently so updated the links).

<td>Cluster loader</td>
</tr>
<tr>
<td>Scalability and performanc tests</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfomanc spelling error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TODO(wojtek-t, shyamjvs): Write it down and link here.
- We require significant changes (in terms of impact, such as: update of etcd,
update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a SIG-scalability approver and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are the approvers documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added subsection in the roles section.

<tr>
<td>Subproject</td>
<td>Description</td>
<td>Example Artifacts</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this table have a column for a link to the OWNERS file? It requires a bit of digging for each of these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (modulo one line, which I would like to figure out independently in parallel - hopefully that doesn't block charter).

@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch 8 times, most recently from 33a545a to dfc57ed Compare July 3, 2018 09:28
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2018
@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch from 95a5f64 to 8b24001 Compare August 27, 2018 08:49
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

## Scope

SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes, define, test and measure performance and scalability related
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion - done.


[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

### SIG Scalability approvers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I thought a bit more about that, and:

  • longer term it may make sense to move to sigs.yaml
  • currently, it's not supported there (it's somewhat sig-specific role) and i don't want to add that now.

Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. There are a couple of areas I would like to see some more explicit definition of scope. I may not have time to review the next iteration until Thursday.


[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

### SIG Scalability approvers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like you're listing Tech Leads, which yes, should be in sigs.yaml, and is supported as leadership.tech_leads

If you're trying to define a decision making process (ie: at least 3 of these must approve, any one of these has the authority to approve, etc) I would instead put this under a Technical Processes header, and add boilerplate about how you're overriding sig-governance in the following ways...


## Scope

SIG Scalability primary responsibilities are to define scalability goals for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: SIG Scalability's

to define and drive scalability goals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes. This involves defining, testing and measuring performance and
scalability related to Service Level Indicators (SLIs) and ensure that every
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/to//

nit: s/ensure/ensuring/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes. This involves defining, testing and measuring performance and
scalability related to Service Level Indicators (SLIs) and ensure that every
Kubernetes release meets Service Level Objectives (SLOs) build on top of those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/build/built

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tighten this statement up a bit. By saying "release" this implies you're only enforcing Kubernetes releases meet these SLOs, and that it is OK for alphas, betas, rcs, and other builds to fail to meet them. Your processes below make me think this is not the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully agree. What we really care is that the release that our users use in production (I really doubt someone is using alpha or beta cut in production) meet scalability needs.
So in theory I would say it's fine if those doesn't meet SLOs.
In practise, it's hard to do stuff and only then see if we didn't regress. So we're trying to ensure that those are met all the time (though we are not there yet).

But this is "implementation detail of the strategy", not something that should be part of the charter in my opinion.

## Scope

SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes. This involves defining, testing and measuring performance and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can more concretely define which parts of Kubernetes you're setting scalability goals for? It might help to refer to the architecture roadmap eg:

  • On one extreme, you probably don't intend to have purview over the scalability goals of everything in the entire Kubernetes project (which I define as all repos in kubernetes, kubernetes-sigs, kubernetes-client, etc)
  • On the other extreme, you probably have purview over more than just the contents of kubernetes/kubernetes (and even within this, you probably don't have purview over all of the addons contained within)
  • You probably don't mean scalability of the project itself, since that falls under contributor experience and to some degree architecture

Another way to consider this: what plugins, runtimes, hardware, workloads, etc. result in a cluster not being held to scalability SLO's?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth considering taking an approach similar to SIG Instrumentation, another horizontal SIG with some areas of discrete ownership:

  • Own performance and scalability best practices for the project as a whole
  • Provide reviews, and guidance to other SIGs
  • Own the definition, implementation, and enforcement of scalability SLIs and SLOs for the artifacts produced as part of a Kubernetes release


- We can rollback any merged PR if it has been identified as a cause of any
[performance/scalability SLOs] regression. The offending PR should only be
merged again after proving to pass tests at scale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily have to do this within this charter, but I'm going to suggest that the release team require regular participation from a member of SIG Scalability until we have such automation in place. Chasing down scalability related failures is too much an additional burden on the CI Signal role, and we often have little visibility into whether or how failures are being worked.

As an example both https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance and https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-correctness have been failing for over two weeks. This is being "explained away" as a quota issue, with code continuing to merge, rather than being enforced as a hard stop until the cause is identified.

TODO(wojtek-t, shyamjvs): Write it down and link here.
- We require significant changes (in terms of impact, such as: update of etcd,
update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tech lead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be tech lead - it's a much wider role and this group may involve other people not necessary being members of sig scalability.

But I'm tired explaining this over and over again, so I'm changing this to tech leads with a TODO for the future.

update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers)
and
- after having passed performance testing on biggest supported clusters (unless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?

How does someone run these jobs as a presubmit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?

If they will be added as release blockers - then yes.

How does someone run these jobs as a presubmit?

They are already optional presubmits that one can manually trigger.
But this is technical detail - such things shouldn't be part of charter.

causes a significant degradation of overall Kubernetes scalability/performance.
(Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but
initially it may also be SIG-scalability decision based on public test results).
- We can block a feature from transitioning to GA status if it cannot be used at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than the requirements spelled out for beta status? I'd like to see this phrased in terms of SLI/SLO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was clear than it apparently is. The overall idea is:

  • Alpha->Beta: this can't break existing SLOs (but feature itself doesn't need to scale well)
    [As an example: we used to have a store with pod anti-affinity that even if it wasn't used, it was completely breaking scheduling throughput. That's not acceptable.]

  • Beta->GA: feature itself has to scale.
    This may (in rare cases) require introducing a new SLI/SLO to verify that, but in majority just using that feature in our scalability tests and verifying that already existing SLOs are met.

Rephrased to better match what i wrote above.

and
- after having passed performance testing on biggest supported clusters (unless
found unnecessary by scalability approver)
- We can block a feature from transitioning to Beta status if (when turned on) it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea. But, you might want to consider rephrasing it such that SIG Architecture actually owns this enforcement, and you are in charge of defining the scalability SLI's/SLO's that they use. Ultimately I think it's going to be less confusing for feature owners to look to a single SIG for go/no-go, and use a checklist of requirements you've helped write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - it's definitely not a goal to have a dedicated SLO for every single feature. That doesn't scale and such a huge list of SLIs would be impossible to track especially for users.

About moving the responsibility to sig architecture: TBH I don't really understand your point. SIG architecture is NOT stamping every single transition between phases for every single feature from what I know.
Assuming that I'm not missing anything here, the point of moving the responsibility to sig architecture doesn't make sense because ultimately it would be us saying sig architecture if they say yes or no. That doesn't make sense to me.

@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch from 8b24001 to f9777ff Compare August 28, 2018 07:00
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL


## Scope

SIG Scalability primary responsibilities are to define scalability goals for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes. This involves defining, testing and measuring performance and
scalability related to Service Level Indicators (SLIs) and ensure that every
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

SIG Scalability primary responsibilities are to define scalability goals for
Kubernetes. This involves defining, testing and measuring performance and
scalability related to Service Level Indicators (SLIs) and ensure that every
Kubernetes release meets Service Level Objectives (SLOs) build on top of those
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully agree. What we really care is that the release that our users use in production (I really doubt someone is using alpha or beta cut in production) meet scalability needs.
So in theory I would say it's fine if those doesn't meet SLOs.
In practise, it's hard to do stuff and only then see if we didn't regress. So we're trying to ensure that those are met all the time (though we are not there yet).

But this is "implementation detail of the strategy", not something that should be part of the charter in my opinion.

- [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2)
- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark)
- Scalability and performance tests:
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark)
- Scalability and performance tests:
- [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than kubemark and tests themselves, no.

TODO(wojtek-t, shyamjvs): Write it down and link here.
- We require significant changes (in terms of impact, such as: update of etcd,
update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be tech lead - it's a much wider role and this group may involve other people not necessary being members of sig scalability.

But I'm tired explaining this over and over again, so I'm changing this to tech leads with a TODO for the future.

update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a [SIG-scalability approver](#sig-scalability-approvers)
and
- after having passed performance testing on biggest supported clusters (unless
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future where there are large cluster jobs being run on multiple cloud providers, will you require that they all pass?

If they will be added as release blockers - then yes.

How does someone run these jobs as a presubmit?

They are already optional presubmits that one can manually trigger.
But this is technical detail - such things shouldn't be part of charter.

and
- after having passed performance testing on biggest supported clusters (unless
found unnecessary by scalability approver)
- We can block a feature from transitioning to Beta status if (when turned on) it
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - it's definitely not a goal to have a dedicated SLO for every single feature. That doesn't scale and such a huge list of SLIs would be impossible to track especially for users.

About moving the responsibility to sig architecture: TBH I don't really understand your point. SIG architecture is NOT stamping every single transition between phases for every single feature from what I know.
Assuming that I'm not missing anything here, the point of moving the responsibility to sig architecture doesn't make sense because ultimately it would be us saying sig architecture if they say yes or no. That doesn't make sense to me.

causes a significant degradation of overall Kubernetes scalability/performance.
(Ideally it would be “SLI degradation of more than X%” or “breaking SLO”, but
initially it may also be SIG-scalability decision based on public test results).
- We can block a feature from transitioning to GA status if it cannot be used at
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was clear than it apparently is. The overall idea is:

  • Alpha->Beta: this can't break existing SLOs (but feature itself doesn't need to scale well)
    [As an example: we used to have a store with pod anti-affinity that even if it wasn't used, it was completely breaking scheduling throughput. That's not acceptable.]

  • Beta->GA: feature itself has to scale.
    This may (in rare cases) require introducing a new SLI/SLO to verify that, but in majority just using that feature in our scalability tests and verifying that already existing SLOs are met.

Rephrased to better match what i wrote above.


[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

### SIG Scalability approvers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up for now and switched to tech leads with a TODO to create a dedicated group for it.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fixup the TODOs and minor wording order, then LGTM.

identified and mitigated.
The “Rules of engagement” of pausing merge-queue and rationale for
necessity of its introduce are explained in a separate doc. <br/>
TODO(wojtek-t, shyamjvs): Write it down and link here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

- We can require a SIG to introduce a regression-catching benchmark test for a
scalability-critical functionality.

For the record, by regression above we mean a regression identified by the set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might just want to define taxonomy at the beginning or end, this seems weirdly placed. Could just remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch from f9777ff to e2e2569 Compare August 29, 2018 10:13
@wojtek-t
Copy link
Member Author

PTAL

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing then LGTM

- with an explicit approval from a SIG-scalability tech lead and
- after having passed performance testing on biggest supported clusters (unless
found unnecessary by approver)
TODO(wojtek-t): Create a larger group (whose members doesn't have to necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO - still exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to resolve it now - this is a bigger thing and for now I think it's fine to start with tech leads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove that sentence then, and PR the charter at a later time when TODO is less ambiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense - done.

- Mitigate the regression. This may mean e.g.:
- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this .... It doesn't add anything to the list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough - done.

@wojtek-t wojtek-t force-pushed the sig_scalability_charter branch from e2e2569 to b6ea653 Compare August 30, 2018 06:42
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments - PTAL

- Mitigate the regression. This may mean e.g.:
- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough - done.

- with an explicit approval from a SIG-scalability tech lead and
- after having passed performance testing on biggest supported clusters (unless
found unnecessary by approver)
TODO(wojtek-t): Create a larger group (whose members doesn't have to necessary
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense - done.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@wojtek-t thank you for seeing this through!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: countspongebob, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@timothysc timothysc removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Aug 31, 2018
@timothysc timothysc dismissed stale reviews from philips and spiffxp August 31, 2018 17:52

out-of-date

@k8s-ci-robot k8s-ci-robot merged commit 55d3d7c into kubernetes:master Aug 31, 2018
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc I do not appreciate that this was merged without me having gotten another pass at it. @wojtek-t reached out to me privately asking for another look, which I was in the middle of doing when this was merged.

There a few things that need to be followed up on

[performance/scalability SLOs] regression (identified by the set of release
blocking scalability/performance tests). The offending PR should only be
merged again after proving to pass tests at scale.
- In the even of a performance regression, we can block all PRs from being
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: event

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2617 (thanks)

merged again after proving to pass tests at scale.
- In the even of a performance regression, we can block all PRs from being
merged into the relevant repos until the cause of the regression is
identified and mitigated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2617 - thanks

necessity of its introduce are explained in [a separate doc](./block_merges.md).
- We require significant changes (in terms of impact, such as: update of etcd,
update of Go version, major architectural changes, etc.) may only be merged:
- with an explicit approval from a SIG-scalability tech lead and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see your tech leads in sigs.yaml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding now. Following up on that - expect a PR in the next few days.

### Rules of engagement.
The rules of engagement for blocking merges are as following:

- Observe as scalability regression on one of release-blocking test suites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which suites specifically?

How is a regression observed? The suites have been continually failing, so it doesn't seem like it's a green-to-red transition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "green-red transition".
The fact that tests are red now and it's not possible, shouldn simply mean that we don't have the right.
We need to transition from green to red to declare a regression.

Clarifying that.

- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- fixing the problem (if it's easy and quick to fix)
- Unblock PR merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Unblock merges of all PRs to the relevant repos"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR with that.

The rules of engagement for blocking merges are as following:

- Observe as scalability regression on one of release-blocking test suites.
- Block merges of all PRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block merges of all PRs to the relevant repos, declaring which repos those are and why

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR with that

- fixing the problem (if it's easy and quick to fix)
- Unblock PR merged.

The exact technical mechanisms for it are out of scope for this document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not here then where are these mechanisms documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be hidden in the test-infrastructure (in the past we were talking about opening a bug that is appropriate labeled that merge-bot is looking into). But no matter what the mechanism will be, this is implementation detail and shouldn't be part of this high-level doc.

- once a regression is merged, and no other action is taken, it is only
a matter of time until another regression is merged on top of it,
- debugging the cause of two simultaneous (piled-up) regressions is
exponentially harder, see issue 53255 which links to past experience
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send out a PR for that

a matter of time until another regression is merged on top of it,
- debugging the cause of two simultaneous (piled-up) regressions is
exponentially harder, see issue 53255 which links to past experience
- we need to keep flakiness of merge-blocking jobs very low:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the threshold? I suspect we are not meeting this today, so I would like to quantify how we're approaching this delta from actual to intended

@spiffxp
Copy link
Member

spiffxp commented Aug 31, 2018

@timothysc Next time please give a last call ping prior to merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants