Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cancel reliability by force delete and continuing partial delete #831

Open
bwplotka opened this issue Feb 5, 2025 · 0 comments
Open
Labels

Comments

@bwplotka
Copy link
Member

bwplotka commented Feb 5, 2025

Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.

I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.

This happened recently:

  1. I started prombench textparse: Optimized protobuf parser with custom streaming unmarshal. prometheus#15731 (comment)
  2. Job fails
21:52:25 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 gke.go:582: error while applying a resource err:error applying '/tmp/tmp.NOMpbH/test-infra/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml' err: Request for 'applying deployment:prometheus-test-pr-15731' hasn't completed after retrying 50 times
make: *** [Makefile:88: resource_apply] Error 1
  1. When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building.

  2. Triggered prombench cancel

  3. Cancel fails


22:19:05 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1a_namespace.yaml' err: Request for 'deleting namespace:prombench-[157](https://github.com/prometheus/prometheus/actions/runs/13145657052/job/36683090504#step:4:158)31' hasn't completed after retrying 100 times
make: *** [Makefile:97: resource_delete] Error 1
  1. Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck Prometheus pod - now waiting for termination forever (at least 8h).

  2. I forced deleted it from GKE shell:

 gcloud container clusters get-credentials test-infra --zone europe-west3-a --project macro-mile-203600  && kubectl delete pod prometheus-test-pr-15731-78bdf7bf67-xf6sm --namespace prombench-15731 --grace-period=0 --force
  1. Ran prombench cancel again and it failed again due to loadgen missing (expected after partial delete):
	-f ./manifests/prombench/benchmark/1a_namespace.yaml
08:38:22 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1c_cluster-role-binding.yaml' err: resource delete failed - kind: Role, name: loadgen-scaler: roles.rbac.authorization.k8s.io "loadgen-scaler" not found
make: *** [Makefile:97: resource_delete] Error 1
  1. I have to manually delete node pools (manually do what cancel is doing).

Ideally steps and (5) and (8) does not fail so resetting the benchmark is easier and perhaps faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant