-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3138: Increase the Reliability Bar proposal #3139
Conversation
/hold To give people more time to react. |
a632ea2
to
45a662f
Compare
issues, test flakes etc. This approach isn’t sustainable and doesn’t scale. | ||
We need to make it a business of everyone in the whole community. | ||
|
||
[Production Readines Survey]: https://datastudio.google.com/reporting/2e9c7439-202b-48a9-8c57-4459e0d69c8d/page/GC5HB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we open up access to this dashboard somehow? I still can't access it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k @johnbelamaric - will you be able to open it?
forward is to increase our observability by adding logs, metrics, events, | ||
etc. to allow debugging the future instances of the flake. | ||
|
||
1. Once debugged and fixed, the issue should be labeled with the root cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IME of fixing Kubelet e2es lately, it would be useful to also explicitly track issues where features were modified but tests were not updated properly.
When optional jobs are flaky, a lot of legitimate failures get missed because folks stop reading the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When optional jobs are flaky, a lot of legitimate failures get missed because folks stop reading the results.
+1
Regarding tracking - do you have a more specific suggestion? In priciple, we should be updating tests together with code, so I'm assuming noone is consciously ignoring them. And if it's not conscious, then we can't really track it.
I guess I'm missing some point in your thinking - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of legitimate failures get missed because folks stop reading the results.
what folks do you refer here? I think that each SIG is responsible of reading their results, there is some sig-release group that monitor the critical jobs, but not the optional ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what folks do you refer here? I think that each SIG is responsible of reading their results,
We have so many jobs now, that there are many that noone is really looking at (or looking to the extent to find flakes - rather checking if something isn't hard down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, job ownership is another problem
45a662f
to
7b2ca5f
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wojtek-t The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Given the nature of this KEP, it needs sign-off by a majority of SIG Leads, or by the Steering Committee, or possibly both. Not just Arch. |
1. We will use [Sippy][] as our source of truth for determining the health | ||
of the individual jobs and tests. | ||
We’ve already set up [Kubernetes Sippy][] that is monitoring the jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to make some commitment the tool will be maintained and what sig is going to own it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like a prereq for this work or step 0 is to solidify ownership of the tools and the jobs (which was also noted below).
- The proper labeling of issues would mean that people would be able to easily | ||
find issues currently opened for a given SIG, via a simple github query like | ||
this `https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Asig%2Ffoo+label%3Akind%2Freliability-issue` | ||
- Notify the SIG on Slack asking for explicit acknowledgement on the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I foresee this to be a point of conflict or for bikeshedding,
what if the notification should use multiple channels in parallel, slack and email to start with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean saying here "Notify the SIG on both Slack and email (and whatever else you have on your mind) ..."?
What would it give us? Given that we need explicit ack from them, what multiple channels give us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, forget that comment, I don't have good answer sorry, just pick one channel seems the best solution,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for preparing this @wojtek-t ,
a couple of questions:
-
do you have an approximate duration for the first milestone or even the later ones?
-
the proposal talks about long term investments for raising the bar with additional process. who would ensure the new process functions over time (i.e. who are the actors)? for example, would it fall on the release team or WG Reliability to track / block specific graduations until older problems are resolved?
-
(related to the above) WGs are ephemeral. if the WG has to continue to keep the bar high over time, do you have plans to move the WG to a subproject of Arch or a SIG in the future?
-
would this proposal apply only to what we consider "core" Kubernetes, or would additional tools (that are part of the release) such as kubectl and kubeadm would also be in scope? this might need clarification in "Goals/Non-goals", since between the k and k-sigs orgs there is a lot of code that is not "core" and kubernetes/kubernetes might split further in the future.
with a realistic (and non-conservative) deadline. | ||
*We really hope that this will be effective in the majority of cases and | ||
we will never be forced to use the point (2) special powers.* | ||
1. if the issue (reported above) is not addressed before the deadline, block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposing to punish SIGs is harmful to the spirit of supportive collaboration that is central to Kubernetes's success, and runs counter to the project's values. The reliability bar can be kept high and raised without compulsion. No potential reliability improvements are worth the loss of community goodwill caused by establishing a punitive system of enforcement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I completely don't follow this comment.
This proposal isn't about punishing anyone - I don't know why you're interpreting it as punishment. It's about saying that from the project perspective some things are important than some other things.
If you're working on a project, and there are thing that have to be done but noone is willing to pick them, the end state isn't saying "forget it - let's work on whatever you want to work". That's not how it works. If some works needs to be done, it needs to be done. And we can't replace that work with other kind of work.
You can be arguing that "X isn't important and shouldn't belong to this category" - that's a discussion I'm definitely open to have. But that's not what you're saying.
No potential reliability improvements are worth the loss of community goodwill caused by establishing a punitive system of enforcement.
I don't see how goodwill is related to it. How is goodwill related to the fact that we're saying that some things are important?
Regarding being "worth of". The whole Kubernetes project (as well as any other project) makes sense, as long as someone wants to use it. If it becomes unreliable, no real users are going to use it, because it won't satisfy their needs. Something else will be created.
The fact that Kubernetes has such traction as it has now, doesn't mean it will have the same traction in 2 years.
If we won't provide feature X, it may prevent some group of users who want it from using it. That's obviously bad. But if the system will become more and more unreliable, we can lose every single user of it.
And again - it's not a punishment. It's just saying that some things are more important than others.
In reality - I would prefer that we never have to use this right. And we can always use exceptions to not use it. But we need to be able to achieve certain things.
I can give you an actual example - if you look at SIG scability charter, we have a right to rollback any PR that is breaking scalability:
https://github.com/kubernetes/community/blob/master/sig-scalability/charter.md#what-can-we-dorequire-from-other-sigs
This was approved by steering many years ago.
Do we want to use it? Of course no. Do regressions happen? They do. Do we immediately rollback - of course no. We ask people to fix things. Did we use that in the last 5 releases (I can probably even say 10 or even more). No (I remember we used it once around ~1.10 or so). Does it hurt the community - I doubt.
Now - I'm definitely happy to start this proposal in "dry-run" mode (without any enforcement) and run it like that for say 1-2 releases and see what happens. If it will work as expected, we can leave it like that. But if it won't work the we should do something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sig-scalability rollback was never approved by Steering -- you added it back in the time when steering didn't exist and SIG charters did not need approval. And you've never used it, and if you actually tried you might find that you don't actually have that power.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sig-scalability rollback was never approved by Steering --
yes - it was approved by steering.
And you've never used it
Yes I used that power. I blocked myself making CoreDNS a default one back then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to back it up:
kubernetes/community#2149 is the PR (there was also a follow up to it linked from there) where you clearly see steering committee approval on the charter.
And this it the revert where we used that power:
kubernetes/kubernetes#68629
FWIW - it's the only case that I'm aware of where we used that (so almost 12 releases ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to try to refocus the conversation and don't diverge on specifics, Kubernetes, as other big OpenSource projects are a magnet for companies and/or people to add features and enhance their portfolios or CVs... but the reality is that, on the day a day, there are only a few persons doing the dirty job, fix flakes, upgrade dependencies, .... something that doesn't shine and you can brag on tweet or add to your CV, it also doesn't attract companies, that are not willing to spend time of their employees on those duties.
I think that this is about accountability and responsibility, a SIG is not only there ot add features and to decide what goes it or not, it has to maintain its area of Interest, add tests, check that work, improve reliability, performance ... I can't see how that is not a fair request, maybe we can reword it, but this is a community project, these things are not something that you can handover to the QE department or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes:
- "we can always use exceptions to not use it", these exceptions should be memorialized in policy (in this enhancement)
- " I'm definitely happy to start this proposal in "dry-run" mode (without any enforcement) and run it like that for say 1-2 releases and see what happens" I think the rollout process ie dry-run needs to be memorialized here as well.
These are non-trivial changes will high impact across the community, so let's try to incorporate all of this into the plan so that it is clear what people can expect at each stage of the rollout.
@neolit123 - thanks a lot for the feedback (and sorry for delay). Answering your questions below
I expect that first milestone will not really end. Even when we reach perfect state at some point, new flakes will be appearing when new code is created. But I expect that this eventually will be handed over to Ci Signal team.
It's definitely on WG Reliability for at least few initial releases. We can discuss then if we can't something merge it.
I'm thinking about this. I was thinking also about merging a bit with Production Readiness. But no clear plans for now.
This is great point. Those generally should be in scope eventually (what if kubectl is completely unreliable?), but for simplicity we may want to start with just the "core", see how it works and extend that later. Does that make sense? |
This KEP is missing an necessary Phase I before the Phase I defined in the KEP. Right now, we do not have a definition of reliability on which everyone agrees, and can be empirically defined and observed in our project metrics. We have testjob flakiness, but even flakiness is being measured qualitatively, and surely flakiness is not the entirety of reliability? Phase I should entirely and completely consist of three things:
Once a definition of reliability exists that's backed up by easily read dashboards, then the next step will be to see if reliability improves organically through contributor efforts; it's entirely possible that better visibility is all that's needed. |
I agree with that. |
If you try to enforce acceptance before establishing metrics, what you're doing is not good, and it will not have good results. Saying "I can't define reliability but I know it when I see it" is absolutely no way to run a shared project. Further, if this is just about flakiness, then this effort doesn't need a WG and it doesn't need any special powers. It's just a technical effort under SIG-Testing or SIG-Release. |
Sorry - catching up after vacation. Trying to respond things above.
@lavalamp And here is where the "relation" comes. Because we need someone to actually fix that problem - otherwise we're on the straight road to a catastrophe (because such situations are happening all the time). In other words - I agree that every single feature on its own has some positive value for our end-users. But we're all humans that make mistakes. And we need to admit that those mistakes have to keep getting addressed.
To Jordan`s point above - a hugely agree with this.
This isn't a solution. I've seen a ton of issues that weren't related to scale at all (in fact huge majority of bugs I see in our production is not related to scalability). FWIW, our 5k-node scalability tests running daily are solid green: https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance. So the problem isn't the fact that we don't run something in presubmits.
+1
+100 - I agree that our top problem in testing area is test coverage. The only point is that, we can just start adding more tests, as we will soon drown. We have to ensure that the current set of "important tests" are solid (I'm thinking about release-blocking/release-informing ones) and over time extend this set more and more. Otherwise, we will end-up with even more gazilions of flaky tests and their value will be negligible because everyone will stop paying attention to those.
I'm happy to start with dry-run, but I'm afraid we all know that it won't do the job, just are afraid to said that loudly. We need to ensure that if as a community we're pointing at issues that should be addressed (and in particular for flaky tests CI Signal team is representing our community here), they really will get addressed. [I'm fine with dry-run from the perspective that it would allow us to figure out implementation details of the process that we might be missing here. So it makes sense. But I wouldn't count that it will solve the problem.] |
+1 - I will take this discussion to sig-release. Added to the agenda of the next meeting (Apr 5). Not sure how these "special powers" should look like (:eyes: avoid blocking sigs & blocking new features). Besides the "lack of authority", I see another issue that we need to discuss in sig-release. I think if reliability monitoring is added to the CI signal responsibilities, something more is needed than just changing the description in the handbook. I don't think the team structure is designed to handle the added responsibility.
I think we can solve those questions. I will also carry this discussion over to the next sig-release call (see agenda). |
Thanks @jberkus . I just watched that (unfortunately couldn't attend the actual meeting) and here is my subjective summary of it:
So I can personally read it as two categories:
So I would claim that this is actually aligned with the proposal here - with the first milestone being focused on (1) and the second milestone being focused on (2). My motivation for first focusing on flaky tests and only then on coverage was mainly the following:
That said, as already admitted in the proposal, I definitely agree that lack of proper coverage is certainly one of our top problems. And if there is a wider agreement in pushing first in this direction - we can definitely change the order. |
Wojtek: This is a big problem, and I think we really need to attack it on multiple fronts. Yes, we need to address flakiness in existing test jobs. But that's not a separate issue from coverage, because one of the reasons test jobs are flaky is functionality that lacks good tests and gets broken in ways that doesn't show up until the E2E tests. Both things need to be addressed in tandem. Also, the longer we wait to examine testing standards for new code, the more cleanup we have to do when we get there. |
do we agree this is a SIG responsibility , that should be "enforced" during the review process? |
Sorry for belated response. @wojtek-t
If I believed that would be the actual result of this proposal, I'd likely be in favor. Instead, I believe the result of this proposal is likely (50% chance?) to be a permanent reduction in future capacity, if the rule is invoked, as marginal contributors or even maintainers are driven away and don't come back. (Put another way: you are telling me the policy's intended effect -- I agree that your intentions are good and hopefully it didn't sound like I was saying otherwise. I am complaining about the effect I actually expect the policy to have. I understand what you are trying to accomplish, I just think this proposed policy does not in fact accomplish that.)
IMO that makes it worse, not better -- it will be a big political ordeal to invoke this rule, if that is the plan?
But that's great news -- we can fix it with regular testing improvements, then? Reading subsequent comments does make it seem like there's a lot to be gained from improved testing. |
It is everyone's responsibility. The approving SIG(s), any other reviewers, SIGs Testing, Architecture, and Release. Heck, even the Steering Committee. KEP reviews and code reviews involve a lot of people, not all of them from the owning SIG(s), and all of those people should be looking at the quality of testing for the new features/API changes/whatever. As a project, we need to stop accepting enhancements with inadequate testing. Because, not infrequently, once a feature gets merged, we never see some of the people involved again. |
Even beyond this, we (area/component/sig leads) need to stop accepting enhancements that significantly modify an existing area which is inadequately tested. Even if the newly added feature/code is well tested, regressing existing capabilities and users is a real risk we've seen manifest more frequently in the past few releases. Being aware of where there have been regressions, and where there remain testing gaps or fragility is important to inform what changes are accepted when, and what work is prioritized. |
This one I can get behind!! |
You didn't sound like implying something different - we all want what's best for the project. It's just not black-and-white because different things are more important for different people (which is absolutely ok). That being said - do you have any other suggestions on how we can make more people interested in our testing quality? I would be more than happy to hear that - it's just I don't know what else we can do (for already existing stuff - I agree we can do much more for newly introduced stuff).
Which I can live with - as everyone is pointing out (and I fully agree with that), the consequences of that are potentially significant. So we really want to ensure treat it as a last resort. If we (as a project) decide that even though SIG Foo didn't do X as they should have but we believe there were good reasons for it and we shouldn't take any actions - that's fine. Even having those discussions would benefit the project imho.
Huge +100 I personally don't think that it is enough and I still think a proposal like this is needed. |
BTW, I will point out that the Release Team just turned down an enhancement exception on the basis of stability. I think a piece of this puzzle is to help the Release and Enhancements Team better judge stability effects and feel that they have the project backing to reject things. |
Two issues that I opened today that are related to it: |
We have to hold our SIGs accountable for CI jobs in their own area see: We are sorely failing to even do the minimum of keeping our CI jobs up-to-date and 💚 Also see kubernetes/test-infra#18600 and kubernetes/test-infra#18601 |
Given #3279 and my recent open letter to the dev@kubernetes.io list, I think we should take a bit of time to see if those have positive effects. I propose we re-evaluate this, say, when we're in code freeze for 1.25 (one release from now). |
Yes - let's not try to target 1.25 with this. I would like to leave this PR open, to not forget about it, but the KEP as described is definitely not targeting 1.25. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
👀 |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
One-line PR description: Increase Reliability Bar proposal
Issue link: Increase Reliability Bar #3138
Other comments:
/assign @johnbelamaric @derekwaynecarr @dims
/cc @smarterclayton @deads2k @liggitt @thockin