Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3138: Increase the Reliability Bar proposal #3139

Closed

Conversation

wojtek-t
Copy link
Member

/assign @johnbelamaric @derekwaynecarr @dims

/cc @smarterclayton @deads2k @liggitt @thockin

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2022
@wojtek-t
Copy link
Member Author

/hold

To give people more time to react.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels Jan 14, 2022
@wojtek-t wojtek-t force-pushed the increase_reliability_bar branch from a632ea2 to 45a662f Compare January 14, 2022 13:01
issues, test flakes etc. This approach isn’t sustainable and doesn’t scale.
We need to make it a business of everyone in the whole community.

[Production Readines Survey]: https://datastudio.google.com/reporting/2e9c7439-202b-48a9-8c57-4459e0d69c8d/page/GC5HB
Copy link
Member

@ahg-g ahg-g Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we open up access to this dashboard somehow? I still can't access it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k @johnbelamaric - will you be able to open it?

forward is to increase our observability by adding logs, metrics, events,
etc. to allow debugging the future instances of the flake.

1. Once debugged and fixed, the issue should be labeled with the root cause
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IME of fixing Kubelet e2es lately, it would be useful to also explicitly track issues where features were modified but tests were not updated properly.

When optional jobs are flaky, a lot of legitimate failures get missed because folks stop reading the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When optional jobs are flaky, a lot of legitimate failures get missed because folks stop reading the results.

+1

Regarding tracking - do you have a more specific suggestion? In priciple, we should be updating tests together with code, so I'm assuming noone is consciously ignoring them. And if it's not conscious, then we can't really track it.
I guess I'm missing some point in your thinking - can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of legitimate failures get missed because folks stop reading the results.

what folks do you refer here? I think that each SIG is responsible of reading their results, there is some sig-release group that monitor the critical jobs, but not the optional ones

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what folks do you refer here? I think that each SIG is responsible of reading their results,

We have so many jobs now, that there are many that noone is really looking at (or looking to the extent to find flakes - rather checking if something isn't hard down).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, job ownership is another problem

@wojtek-t wojtek-t force-pushed the increase_reliability_bar branch from 45a662f to 7b2ca5f Compare January 17, 2022 09:49
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
To complete the pull request process, please ask for approval from johnbelamaric after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jberkus
Copy link
Contributor

jberkus commented Jan 17, 2022

Given the nature of this KEP, it needs sign-off by a majority of SIG Leads, or by the Steering Committee, or possibly both. Not just Arch.

Comment on lines +233 to +235
1. We will use [Sippy][] as our source of truth for determining the health
of the individual jobs and tests.
We’ve already set up [Kubernetes Sippy][] that is monitoring the jobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to make some commitment the tool will be maintained and what sig is going to own it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a prereq for this work or step 0 is to solidify ownership of the tools and the jobs (which was also noted below).

- The proper labeling of issues would mean that people would be able to easily
find issues currently opened for a given SIG, via a simple github query like
this `https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Asig%2Ffoo+label%3Akind%2Freliability-issue`
- Notify the SIG on Slack asking for explicit acknowledgement on the issue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I foresee this to be a point of conflict or for bikeshedding,
what if the notification should use multiple channels in parallel, slack and email to start with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean saying here "Notify the SIG on both Slack and email (and whatever else you have on your mind) ..."?

What would it give us? Given that we need explicit ack from them, what multiple channels give us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, forget that comment, I don't have good answer sorry, just pick one channel seems the best solution,

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for preparing this @wojtek-t ,
a couple of questions:

  • do you have an approximate duration for the first milestone or even the later ones?

  • the proposal talks about long term investments for raising the bar with additional process. who would ensure the new process functions over time (i.e. who are the actors)? for example, would it fall on the release team or WG Reliability to track / block specific graduations until older problems are resolved?

  • (related to the above) WGs are ephemeral. if the WG has to continue to keep the bar high over time, do you have plans to move the WG to a subproject of Arch or a SIG in the future?

  • would this proposal apply only to what we consider "core" Kubernetes, or would additional tools (that are part of the release) such as kubectl and kubeadm would also be in scope? this might need clarification in "Goals/Non-goals", since between the k and k-sigs orgs there is a lot of code that is not "core" and kubernetes/kubernetes might split further in the future.

@kikisdeliveryservice kikisdeliveryservice changed the title Increase the Reliability Bar proposal KEP-3138: Increase the Reliability Bar proposal Feb 3, 2022
with a realistic (and non-conservative) deadline.
*We really hope that this will be effective in the majority of cases and
we will never be forced to use the point (2) special powers.*
1. if the issue (reported above) is not addressed before the deadline, block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposing to punish SIGs is harmful to the spirit of supportive collaboration that is central to Kubernetes's success, and runs counter to the project's values. The reliability bar can be kept high and raised without compulsion. No potential reliability improvements are worth the loss of community goodwill caused by establishing a punitive system of enforcement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I completely don't follow this comment.
This proposal isn't about punishing anyone - I don't know why you're interpreting it as punishment. It's about saying that from the project perspective some things are important than some other things.

If you're working on a project, and there are thing that have to be done but noone is willing to pick them, the end state isn't saying "forget it - let's work on whatever you want to work". That's not how it works. If some works needs to be done, it needs to be done. And we can't replace that work with other kind of work.

You can be arguing that "X isn't important and shouldn't belong to this category" - that's a discussion I'm definitely open to have. But that's not what you're saying.

No potential reliability improvements are worth the loss of community goodwill caused by establishing a punitive system of enforcement.

I don't see how goodwill is related to it. How is goodwill related to the fact that we're saying that some things are important?
Regarding being "worth of". The whole Kubernetes project (as well as any other project) makes sense, as long as someone wants to use it. If it becomes unreliable, no real users are going to use it, because it won't satisfy their needs. Something else will be created.

The fact that Kubernetes has such traction as it has now, doesn't mean it will have the same traction in 2 years.
If we won't provide feature X, it may prevent some group of users who want it from using it. That's obviously bad. But if the system will become more and more unreliable, we can lose every single user of it.

And again - it's not a punishment. It's just saying that some things are more important than others.

In reality - I would prefer that we never have to use this right. And we can always use exceptions to not use it. But we need to be able to achieve certain things.

I can give you an actual example - if you look at SIG scability charter, we have a right to rollback any PR that is breaking scalability:
https://github.com/kubernetes/community/blob/master/sig-scalability/charter.md#what-can-we-dorequire-from-other-sigs
This was approved by steering many years ago.
Do we want to use it? Of course no. Do regressions happen? They do. Do we immediately rollback - of course no. We ask people to fix things. Did we use that in the last 5 releases (I can probably even say 10 or even more). No (I remember we used it once around ~1.10 or so). Does it hurt the community - I doubt.

Now - I'm definitely happy to start this proposal in "dry-run" mode (without any enforcement) and run it like that for say 1-2 releases and see what happens. If it will work as expected, we can leave it like that. But if it won't work the we should do something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sig-scalability rollback was never approved by Steering -- you added it back in the time when steering didn't exist and SIG charters did not need approval. And you've never used it, and if you actually tried you might find that you don't actually have that power.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sig-scalability rollback was never approved by Steering --

yes - it was approved by steering.

And you've never used it

Yes I used that power. I blocked myself making CoreDNS a default one back then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to back it up:

kubernetes/community#2149 is the PR (there was also a follow up to it linked from there) where you clearly see steering committee approval on the charter.

And this it the revert where we used that power:
kubernetes/kubernetes#68629
FWIW - it's the only case that I'm aware of where we used that (so almost 12 releases ago).

Copy link
Member

@aojea aojea Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to try to refocus the conversation and don't diverge on specifics, Kubernetes, as other big OpenSource projects are a magnet for companies and/or people to add features and enhance their portfolios or CVs... but the reality is that, on the day a day, there are only a few persons doing the dirty job, fix flakes, upgrade dependencies, .... something that doesn't shine and you can brag on tweet or add to your CV, it also doesn't attract companies, that are not willing to spend time of their employees on those duties.

I think that this is about accountability and responsibility, a SIG is not only there ot add features and to decide what goes it or not, it has to maintain its area of Interest, add tests, check that work, improve reliability, performance ... I can't see how that is not a fair request, maybe we can reword it, but this is a community project, these things are not something that you can handover to the QE department or something like that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. "we can always use exceptions to not use it", these exceptions should be memorialized in policy (in this enhancement)
  2. " I'm definitely happy to start this proposal in "dry-run" mode (without any enforcement) and run it like that for say 1-2 releases and see what happens" I think the rollout process ie dry-run needs to be memorialized here as well.

These are non-trivial changes will high impact across the community, so let's try to incorporate all of this into the plan so that it is clear what people can expect at each stage of the rollout.

@wojtek-t
Copy link
Member Author

@neolit123 - thanks a lot for the feedback (and sorry for delay). Answering your questions below

do you have an approximate duration for the first milestone or even the later ones?

I expect that first milestone will not really end. Even when we reach perfect state at some point, new flakes will be appearing when new code is created. But I expect that this eventually will be handed over to Ci Signal team.

the proposal talks about long term investments for raising the bar with additional process. who would ensure the new process functions over time (i.e. who are the actors)? for example, would it fall on the release team or WG Reliability to track / block specific graduations until older problems are resolved?

It's definitely on WG Reliability for at least few initial releases. We can discuss then if we can't something merge it.

(related to the above) WGs are ephemeral. if the WG has to continue to keep the bar high over time, do you have plans to move the WG to a subproject of Arch or a SIG in the future?

I'm thinking about this. I was thinking also about merging a bit with Production Readiness. But no clear plans for now.

would this proposal apply only to what we consider "core" Kubernetes, or would additional tools (that are part of the release) such as kubectl and kubeadm would also be in scope? this might need clarification in "Goals/Non-goals", since between the k and k-sigs orgs there is a lot of code that is not "core" and kubernetes/kubernetes might split further in the future.

This is great point. Those generally should be in scope eventually (what if kubectl is completely unreliable?), but for simplicity we may want to start with just the "core", see how it works and extend that later. Does that make sense?

@jberkus
Copy link
Contributor

jberkus commented Feb 16, 2022

This KEP is missing an necessary Phase I before the Phase I defined in the KEP.

Right now, we do not have a definition of reliability on which everyone agrees, and can be empirically defined and observed in our project metrics. We have testjob flakiness, but even flakiness is being measured qualitatively, and surely flakiness is not the entirety of reliability?

Phase I should entirely and completely consist of three things:

  1. Creating a definition of "reliability" on which the majority of the project (meaning a majority of maintainers across all SIGs) agrees.
  2. Improving existing tools, and adding new ones, that allow all contributors to clearly and easily see how we're doing in regard to this definition of reliability.
  3. Setting a series of targets based on fulfilling the above reliability in empirical terms, likely escalating with each successive release.

Once a definition of reliability exists that's backed up by easily read dashboards, then the next step will be to see if reliability improves organically through contributor efforts; it's entirely possible that better visibility is all that's needed.

@wojtek-t
Copy link
Member Author

We have testjob flakiness, but even flakiness is being measured qualitatively, and surely flakiness is not the entirety of reliability?

I agree with that.
But I also believe that "good is better than perfect". Everyone agrees that flakiness is a component of reliability and instead of hanging more time on that, we should just start (as I mentioned in another comment, it may mean starting in dry-run mode) and start making progress, instead of trying to figure out everything at the very beginning.

@jberkus
Copy link
Contributor

jberkus commented Feb 17, 2022

But I also believe that "good is better than perfect". Everyone agrees that flakiness is a component of reliability and instead of hanging more time on that, we should just start (as I mentioned in another comment, it may mean starting in dry-run mode) and start making progress, instead of trying to figure out everything at the very beginning.

If you try to enforce acceptance before establishing metrics, what you're doing is not good, and it will not have good results.

Saying "I can't define reliability but I know it when I see it" is absolutely no way to run a shared project.

Further, if this is just about flakiness, then this effort doesn't need a WG and it doesn't need any special powers. It's just a technical effort under SIG-Testing or SIG-Release.

@wojtek-t
Copy link
Member Author

wojtek-t commented Mar 25, 2022

Sorry - catching up after vacation. Trying to respond things above.

The perceived, or estimated, net value of the contribution is always positive at the time it is accepted, or it would not be accepted.

If it turns out that the perception is in error, it does not fix that error to then block other (EDIT: unrelated) contributions. You can't fix a type 2 error by also committing a type 1 error later!

@lavalamp
I don't fully agree with this logic. The part that I disagree with is that "those are unrelated".
I agree that at the time of submission we believe a contribution has a positive value. But as you point out, we're making errors. So let's now imagine that couple days/weeks/months after we realize that the contribution actually didn't have positive value. And we need to invest into it to actually let it have positive value by improving it.

And here is where the "relation" comes. Because we need someone to actually fix that problem - otherwise we're on the straight road to a catastrophe (because such situations are happening all the time).
We can't force individuals to do any job, but as a group (any individual SIG) we have some capacity to do things (I agree that we're all understaffed, but we have to admit that we have some capacity - otherwise we wouldn't be moving at all).
So now the question is how do we spend this capacity. We can spend it the way we're spending this capacity now (reviewing incoming proposal and PRs is actually consuming a lot of ours capacity) or we can invest the capacity that we saved on not reviewing certain contributions on improving the mistakes from the past.
If there isn't anyone who volunteers to pick up the work that has to happen, I claim that it's responsibility of the maintainers to do this job. And if they do this, this automatically would mean they won't have enough capacity to review new proposals code etc.
So this proposal is only making the above our mandate for things we should have been doing for a long time. And (apart from heroics mentioned couple times by different people above) - we're not doing the best job here.

In other words - I agree that every single feature on its own has some positive value for our end-users. But we're all humans that make mistakes. And we need to admit that those mistakes have to keep getting addressed.
I think we all agree that "addressing mistakes" also has a positive value. And if we don't have capacity to do both - we have to prioritize.
This proposal is effectively saying that certain fraction of capacity has to go into addressing mistakes, even if that would result in some feature work not happening/being postponed.

I understand that perception, but I don't see this as punitive. My primary concern is actual further destabilization of areas by addition of new features/complexity, and preventing new features in unhealthy areas seems like a reasonable proposal in response.

To Jordan`s point above - a hugely agree with this.
And to the "being punitive" part - we definitely don't want to penalize anyone. And this is one of the reasons (maybe we should make it more explicit in the proposal) this is the reason why blocking contributions should never be automated. There can be many objective reasons why certain thing didn't happen and in the end it will be ours (community - whether it will be Release Team, SIG Architecture or some other group - I don't have any strong opinion) decision and we can always decide not to block.
But we need a way to stop degrading the quality of the product.

I think the better way to make progress on this is to make scalability tests which are cheap enough to run pre-merge. I think most things that break a cluster at large scale should also break a smaller but appropriately resource-constrained scenario?

This isn't a solution. I've seen a ton of issues that weren't related to scale at all (in fact huge majority of bugs I see in our production is not related to scalability).
I'm not saying we don't have scalability-related ones - we definitely have.
But you can't claim that if we make scalability tests pre-merge we will be done.

FWIW, our 5k-node scalability tests running daily are solid green: https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance. So the problem isn't the fact that we don't run something in presubmits.

It's not just scalability, though, or even primarily. Downgrade tests have been broken forever and nobody's fixing them, and we've kinda given up on them. And currently an upgrade test on master-blocking has been broken for two weeks and is not getting fixed.

+1

Now we have them green but still regularly have escaped critical bugs due to missing coverage and attempts at increasing it can be incredibly hard/slow to land :(.

@endocrimes

+100 - I agree that our top problem in testing area is test coverage.

The only point is that, we can just start adding more tests, as we will soon drown. We have to ensure that the current set of "important tests" are solid (I'm thinking about release-blocking/release-informing ones) and over time extend this set more and more. Otherwise, we will end-up with even more gazilions of flaky tests and their value will be negligible because everyone will stop paying attention to those.

Since reliability will always be an issue and no silver bullet can be found for it, I think it would be useful to find a first start (without the drastic measures described) and increase reliability with many small following steps (I'm +1 for the proposed dry run).

@leonardpahlke

I'm happy to start with dry-run, but I'm afraid we all know that it won't do the job, just are afraid to said that loudly.
CI Signal team is doing a bunch of work to track where we are, but they are missing power to have the things they found being addresses (I don't want to point at anyone, but I also want to make a specific example - so this is one: kubernetes/kubernetes#107530 )

We need to ensure that if as a community we're pointing at issues that should be addressed (and in particular for flaky tests CI Signal team is representing our community here), they really will get addressed.
And honestly - when we exclude some heroics - we don't have a good history in that. And I don't know how "dry run" would change that.

[I'm fine with dry-run from the perspective that it would allow us to figure out implementation details of the process that we might be missing here. So it makes sense. But I wouldn't count that it will solve the problem.]

@leonardpahlke
Copy link
Member

leonardpahlke commented Mar 25, 2022

CI Signal team is doing a bunch of work to track where we are, but they are missing power to have the things they found being addresses (I don't want to point at anyone, but I also want to make a specific example - so this is one: kubernetes/kubernetes#107530 )

+1 - I will take this discussion to sig-release. Added to the agenda of the next meeting (Apr 5). Not sure how these "special powers" should look like (:eyes: avoid blocking sigs & blocking new features).

Besides the "lack of authority", I see another issue that we need to discuss in sig-release. I think if reliability monitoring is added to the CI signal responsibilities, something more is needed than just changing the description in the handbook. I don't think the team structure is designed to handle the added responsibility.

  • Rotation of the team for each cycle (-> makes it difficult to build knowledge and experience)
  • Team structure, 1 Lead & 4 Shadows (-> CI Signal already has a packed job and handbook)

I think we can solve those questions. I will also carry this discussion over to the next sig-release call (see agenda).

@jberkus
Copy link
Contributor

jberkus commented Mar 26, 2022

We had an extensive discussion at the last Community Meeting, which included a number of ideas for improving reliability.

@wojtek-t
Copy link
Member Author

We had an extensive discussion at the last Community Meeting, which included a number of ideas for improving reliability.

Thanks @jberkus . I just watched that (unfortunately couldn't attend the actual meeting) and here is my subjective summary of it:

  • we're lacking incentive for fixing tests (including that contributors come and go, often promissing test delivery later and not doing that, but even more often just going and leaving the particular feature unmaintained)
  • we have a lot of tests and we should just audit them
  • tooling/knowledge how to debug tests is an important aspect causing a lot of pain for many people
  • test coverage & flakiness should be considered as an input for approval of features in certain area
  • proper test coverage should be a requirement of merging the actual code
  • we're too heavily relying on on integration/e2e tests - and missing a bunch of unit tests coverage

So I can personally read it as two categories:

  1. problems with ensuring that issues (in particular flaky tests) for existing features are getting addresses (this is amplified by the lack of knowledge/tooling on how to debug those issues)
  2. problems with test coverage

So I would claim that this is actually aligned with the proposal here - with the first milestone being focused on (1) and the second milestone being focused on (2).

My motivation for first focusing on flaky tests and only then on coverage was mainly the following:

  • if we start adding more and more tests, without systematically addressing our flakiness problems, the number of flakes will only be growing, making our problems in this category growing further over time
  • so I think we should first start with addressing the flakiness aspect, as a preparation for increasing our coverage

That said, as already admitted in the proposal, I definitely agree that lack of proper coverage is certainly one of our top problems. And if there is a wider agreement in pushing first in this direction - we can definitely change the order.
My main thought here is that, without addressing the flakiness problem, the problem of flakiness will only be growing, so we just have to attach it almost immediately after if we believe we should start with coverage first.

@jberkus
Copy link
Contributor

jberkus commented Apr 4, 2022

Wojtek:

This is a big problem, and I think we really need to attack it on multiple fronts. Yes, we need to address flakiness in existing test jobs. But that's not a separate issue from coverage, because one of the reasons test jobs are flaky is functionality that lacks good tests and gets broken in ways that doesn't show up until the E2E tests. Both things need to be addressed in tandem.

Also, the longer we wait to examine testing standards for new code, the more cleanup we have to do when we get there.

@aojea
Copy link
Member

aojea commented Apr 5, 2022

Also, the longer we wait to examine testing standards for new code, the more cleanup we have to do when we get there.

do we agree this is a SIG responsibility , that should be "enforced" during the review process?

@lavalamp
Copy link
Member

Sorry for belated response. @wojtek-t

This proposal is effectively saying that certain fraction of capacity has to go into addressing mistakes, even if that would result in some feature work not happening/being postponed.

If I believed that would be the actual result of this proposal, I'd likely be in favor. Instead, I believe the result of this proposal is likely (50% chance?) to be a permanent reduction in future capacity, if the rule is invoked, as marginal contributors or even maintainers are driven away and don't come back.

(Put another way: you are telling me the policy's intended effect -- I agree that your intentions are good and hopefully it didn't sound like I was saying otherwise. I am complaining about the effect I actually expect the policy to have. I understand what you are trying to accomplish, I just think this proposed policy does not in fact accomplish that.)

we definitely don't want to penalize anyone. And this is one of the reasons (maybe we should make it more explicit in the proposal) this is the reason why blocking contributions should never be automated.

IMO that makes it worse, not better -- it will be a big political ordeal to invoke this rule, if that is the plan?

I've seen a ton of issues that weren't related to scale at all (in fact huge majority of bugs I see in our production is not related to scalability).

But that's great news -- we can fix it with regular testing improvements, then?

Reading subsequent comments does make it seem like there's a lot to be gained from improved testing.

@jberkus
Copy link
Contributor

jberkus commented Apr 12, 2022

Also, the longer we wait to examine testing standards for new code, the more cleanup we have to do when we get there.

do we agree this is a SIG responsibility , that should be "enforced" during the review process?

It is everyone's responsibility. The approving SIG(s), any other reviewers, SIGs Testing, Architecture, and Release. Heck, even the Steering Committee. KEP reviews and code reviews involve a lot of people, not all of them from the owning SIG(s), and all of those people should be looking at the quality of testing for the new features/API changes/whatever.

As a project, we need to stop accepting enhancements with inadequate testing. Because, not infrequently, once a feature gets merged, we never see some of the people involved again.

@liggitt
Copy link
Member

liggitt commented Apr 12, 2022

As a project, we need to stop accepting enhancements with inadequate testing

Even beyond this, we (area/component/sig leads) need to stop accepting enhancements that significantly modify an existing area which is inadequately tested. Even if the newly added feature/code is well tested, regressing existing capabilities and users is a real risk we've seen manifest more frequently in the past few releases.

Being aware of where there have been regressions, and where there remain testing gaps or fragility is important to inform what changes are accepted when, and what work is prioritized.

@lavalamp
Copy link
Member

stop accepting enhancements that significantly modify an existing area which is inadequately tested

This one I can get behind!!

@wojtek-t
Copy link
Member Author

@lavalamp

(Put another way: you are telling me the policy's intended effect -- I agree that your intentions are good and hopefully it didn't sound like I was saying otherwise. I am complaining about the effect I actually expect the policy to have. I understand what you are trying to accomplish, I just think this proposed policy does not in fact accomplish that.)

You didn't sound like implying something different - we all want what's best for the project. It's just not black-and-white because different things are more important for different people (which is absolutely ok).

That being said - do you have any other suggestions on how we can make more people interested in our testing quality? I would be more than happy to hear that - it's just I don't know what else we can do (for already existing stuff - I agree we can do much more for newly introduced stuff).

IMO that makes it worse, not better -- it will be a big political ordeal to invoke this rule, if that is the plan?

Which I can live with - as everyone is pointing out (and I fully agree with that), the consequences of that are potentially significant. So we really want to ensure treat it as a last resort. If we (as a project) decide that even though SIG Foo didn't do X as they should have but we believe there were good reasons for it and we shouldn't take any actions - that's fine. Even having those discussions would benefit the project imho.

@liggitt

Even beyond this, we (area/component/sig leads) need to stop accepting enhancements that significantly modify an existing area which is inadequately tested. Even if the newly added feature/code is well tested, regressing existing capabilities and users is a real risk we've seen manifest more frequently in the past few releases.

Huge +100
I actually went ahead and opened #3279 as discussion started for some more concrete proposal in this area. I'm fine with changing anything there - it's juts easier to have an entry point for discussion.

I personally don't think that it is enough and I still think a proposal like this is needed.
But that will certainly help us and for the sake of making progress we can absolutely start with something we seem to have a high-level consensus.

@jberkus
Copy link
Contributor

jberkus commented Apr 14, 2022

BTW, I will point out that the Release Team just turned down an enhancement exception on the basis of stability. I think a piece of this puzzle is to help the Release and Enhancements Team better judge stability effects and feel that they have the project backing to reject things.

@wojtek-t
Copy link
Member Author

Two issues that I opened today that are related to it:
kubernetes/test-infra#25966
kubernetes/test-infra#25967

@dims
Copy link
Member

dims commented Apr 18, 2022

We have to hold our SIGs accountable for CI jobs in their own area see:
kubernetes/kubernetes#109521

We are sorely failing to even do the minimum of keeping our CI jobs up-to-date and 💚

Also see kubernetes/test-infra#18600 and kubernetes/test-infra#18601

@lavalamp
Copy link
Member

Given #3279 and my recent open letter to the dev@kubernetes.io list, I think we should take a bit of time to see if those have positive effects. I propose we re-evaluate this, say, when we're in code freeze for 1.25 (one release from now).

@wojtek-t
Copy link
Member Author

Given #3279 and my recent open letter to the dev@kubernetes.io list, I think we should take a bit of time to see if those have positive effects. I propose we re-evaluate this, say, when we're in code freeze for 1.25 (one release from now).

Yes - let's not try to target 1.25 with this.
Let's see what will change with the changes we're doing now and try to build a bit more tooling (e.g. around coverage) and see where this will get us.

I would like to leave this PR open, to not forget about it, but the KEP as described is definitely not targeting 1.25.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 2, 2022
@marcindulak
Copy link

👀

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.