Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

spiffxp · 2020-08-01T03:53:19Z

Why this is important:

jobs that have been failing for hundreds of days are a drain on community resources
the fact that they've been failing this long means we've been getting by without their signal, it's probably more economical to cut our losses rather than make diving saves

http://storage.googleapis.com/k8s-metrics/failures-latest.json provides a list of jobs that have been failing continuously based on results stored in GCS. Note that not everything stored in GCS comes from prow.k8s.io; we allow for federated test results via https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml

Good candidates for removal include:

failing > 365 days
runs on prow.k8s.io but is testing out-of-support releases

Make sure to include either @spiffxp or @BenTheElder on PRs for these. Not all of these are clear cut removals and we may want to make efforts to find a job owner or otherwise find a way to mitigate.

We should close this issue once we decide what a formal definition of "egregious" is, and verify that we've handled everything that meets it. We should then feed whatever we've learned here into a policy of maintaining job health going forward (which is basically the end goal of #18599 as well)

fejta-bot · 2020-10-30T04:31:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-11-29T05:15:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

BenTheElder · 2020-12-01T00:14:51Z

/remove-lifecycle rotten

spiffxp · 2021-01-08T21:38:57Z

We still have egregiously perma-failing jobs. For example, the top 3 from http://storage.googleapis.com/k8s-metrics/failures-latest.json

  "ci-kubernetes-node-kubelet-serial": {
    "failing_days": 1098
  },
  "ci-kubernetes-e2enode-ubuntu2-k8sstable3-gkespec": {
    "failing_days": 1021
  },
  "ci-kubernetes-e2e-gci-gce-statefulset": {
    "failing_days": 969
  },

spiffxp · 2021-03-04T00:02:25Z

#21141 removed one

Need to refresh where we're at here.

liggitt · 2021-03-06T17:11:23Z

Jobs that fail 100% of Up or Test are good candidates - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5E(Up%7CTest)%24

fejta-bot · 2021-06-04T17:16:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-04T17:56:52Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-03T18:45:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-08-03T18:46:07Z

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2021-11-08T17:04:39Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenTheElder · 2021-11-08T18:01:35Z

/remove-lifecycle stale
/lifecycle frozen
These jobs aren't going anywhere and this has to be dealt with someday

dims · 2022-04-18T18:37:39Z

xref: kubernetes/kubernetes#109521

dims · 2022-04-18T19:06:15Z

/assign

spiffxp added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/jobs labels Aug 1, 2020

This was referenced Aug 1, 2020

Kubernetes CI Policy: create and enforce policy of removing continuously unhealthy jobs #18601

Open

Kubernetes CI Policy (Umbrella issue) #18551

Open

spiffxp mentioned this issue Aug 21, 2020

jobs: remove jobs that have been continuously failing for over N days #8861

Closed

MushuEE mentioned this issue Sep 23, 2020

Unhealthy Job AutoJailer Bot #19320

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 29, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 1, 2020

spiffxp mentioned this issue Mar 4, 2021

delete jobs for github.com/apache-spark-on-k8s/spark-integration #21141

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2021

k8s-ci-robot closed this as completed Aug 3, 2021

k8s-ci-robot added this to the v1.23 milestone Aug 10, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 8, 2021

dims mentioned this issue Apr 18, 2022

Perma failing Jobs in test-grid kubernetes/kubernetes#109521

Closed

28 tasks

dims mentioned this issue Apr 18, 2022

KEP-3138: Increase the Reliability Bar proposal kubernetes/enhancements#3139

Closed

k8s-ci-robot assigned dims Apr 18, 2022

BenTheElder modified the milestones: v1.23, v1.25 Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

spiffxp commented Aug 1, 2020

fejta-bot commented Oct 30, 2020

fejta-bot commented Nov 29, 2020

BenTheElder commented Dec 1, 2020

spiffxp commented Jan 8, 2021

spiffxp commented Mar 4, 2021

liggitt commented Mar 6, 2021

fejta-bot commented Jun 4, 2021

fejta-bot commented Jul 4, 2021

k8s-triage-robot commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

k8s-triage-robot commented Nov 8, 2021

BenTheElder commented Nov 8, 2021

dims commented Apr 18, 2022

dims commented Apr 18, 2022

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

Comments

spiffxp commented Aug 1, 2020

fejta-bot commented Oct 30, 2020

fejta-bot commented Nov 29, 2020

BenTheElder commented Dec 1, 2020

spiffxp commented Jan 8, 2021

spiffxp commented Mar 4, 2021

liggitt commented Mar 6, 2021

fejta-bot commented Jun 4, 2021

fejta-bot commented Jul 4, 2021

k8s-triage-robot commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

k8s-triage-robot commented Nov 8, 2021

BenTheElder commented Nov 8, 2021

dims commented Apr 18, 2022

dims commented Apr 18, 2022