boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

ixdy · 2020-01-10T21:27:19Z

Due to programming errors, the janitor may continuously fail to clean up a resource. Two examples I just discovered:

possibly an order-of-deletion issue:

{"error":"exit status 1","level":"info","msg":"failed to clean up project kube-gke-upg-1-2-1-3-upg-clu-n, error info: Activated service account credentials for: [pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com]\nERROR: (gcloud.compute.networks.delete) Could not fetch resource:\n - The network resource 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/networks/jenkins-e2e' is already being used by 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/routes/default-route-92807148d5aa60d1'\n\nError try to delete resources networks: CalledProcessError()\n[=== Start Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' with status 1 ===]\n","time":"2020-01-10T21:03:14Z"}

likely incorrect flags (gcloud changed but we didn't?):

{"error":"exit status 1","level":"info","msg":"failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --region=https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gke-ci-canary/regions/us-central1 \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\n[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]\n","time":"2020-01-10T21:18:55Z"}

It'd be good to have some way of detecting when we're repeatedly failing to clean up a resource.
Not sure yet what the best way would be to track that.

The text was updated successfully, but these errors were encountered:

ixdy · 2020-01-10T21:27:27Z

/area boskos

dims · 2020-01-11T00:08:57Z

@ixdy would also help if we can publish the logs from boskos somewhere public.

ixdy · 2020-01-11T01:33:33Z

@dims I'm not sure where we'd publish them, and I'm also not sure we've done a great job of sanitizing the logs yet (to ensure that they're not leaking any sensitive information). The logs are visible to anyone maintaining the prow cluster, though are in some cases lacking useful information.

Regarding tracking cleanup failures, I have a few potential ideas:

If cleanup fails, set metadata on the resource indicating how many cleanup attempts have occurred. Maybe we could use userData rather than adding a new field? If cleanup succeeds, we could clear this information.
Add metrics to the janitor (and start collecting them) tracking the number of successful or failed cleanup attempts, possibly segmented by resource type. This wouldn't tell us if a single resource was repeatedly failing, but a change in rate would be useful for detecting and triaging an issue like Boskos is running out of GKE Projects #15860 before all resources have been exhausted.

ixdy · 2020-01-11T01:36:35Z

Option 2 is closely aligned with #14715.

fejta-bot · 2020-04-10T02:19:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ixdy · 2020-04-10T22:10:40Z

/remove-lifecycle stale

still want to do this.

ixdy · 2020-05-29T00:50:24Z

Moving to kubernetes-sigs/boskos#15.
/close

k8s-ci-robot · 2020-05-29T00:50:38Z

@ixdy: Closing this issue.

In response to this:

Moving to kubernetes-sigs/boskos#15.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ixdy added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 10, 2020

k8s-ci-robot added the area/boskos Issues or PRs related to code in /boskos label Jan 10, 2020

This was referenced Jan 17, 2020

Add more boskos projects tektoncd/plumbing#29

Closed

Boskos seems to be wedged tektoncd/plumbing#186

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2020

ixdy mentioned this issue May 29, 2020

janitor: track when cleanup fails repeatedly for the same resource kubernetes-sigs/boskos#15

Closed

k8s-ci-robot closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

ixdy commented Jan 10, 2020

ixdy commented Jan 10, 2020

dims commented Jan 11, 2020

ixdy commented Jan 11, 2020

ixdy commented Jan 11, 2020

fejta-bot commented Apr 10, 2020

ixdy commented Apr 10, 2020

ixdy commented May 29, 2020

k8s-ci-robot commented May 29, 2020

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

Comments

ixdy commented Jan 10, 2020

ixdy commented Jan 10, 2020

dims commented Jan 11, 2020

ixdy commented Jan 11, 2020

ixdy commented Jan 11, 2020

fejta-bot commented Apr 10, 2020

ixdy commented Apr 10, 2020

ixdy commented May 29, 2020

k8s-ci-robot commented May 29, 2020