Improve cancel reliability by force delete and continuing partial delete #831

bwplotka · 2025-02-05T08:47:08Z

Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.

I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.

This happened recently:

I started prombench textparse: Optimized protobuf parser with custom streaming unmarshal. prometheus#15731 (comment)
Job fails

21:52:25 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 gke.go:582: error while applying a resource err:error applying '/tmp/tmp.NOMpbH/test-infra/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml' err: Request for 'applying deployment:prometheus-test-pr-15731' hasn't completed after retrying 50 times
make: *** [Makefile:88: resource_apply] Error 1

When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building.
Triggered prombench cancel
Cancel fails


22:19:05 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1a_namespace.yaml' err: Request for 'deleting namespace:prombench-[157](https://github.com/prometheus/prometheus/actions/runs/13145657052/job/36683090504#step:4:158)31' hasn't completed after retrying 100 times
make: *** [Makefile:97: resource_delete] Error 1

Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck Prometheus pod - now waiting for termination forever (at least 8h).
I forced deleted it from GKE shell:

 gcloud container clusters get-credentials test-infra --zone europe-west3-a --project macro-mile-203600  && kubectl delete pod prometheus-test-pr-15731-78bdf7bf67-xf6sm --namespace prombench-15731 --grace-period=0 --force

Ran prombench cancel again and it failed again due to loadgen missing (expected after partial delete):

	-f ./manifests/prombench/benchmark/1a_namespace.yaml
08:38:22 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1c_cluster-role-binding.yaml' err: resource delete failed - kind: Role, name: loadgen-scaler: roles.rbac.authorization.k8s.io "loadgen-scaler" not found
make: *** [Makefile:97: resource_delete] Error 1

I have to manually delete node pools (manually do what cancel is doing).

Ideally steps and (5) and (8) does not fail so resetting the benchmark is easier and perhaps faster.

The text was updated successfully, but these errors were encountered:

bwplotka added the bug label Feb 5, 2025

This was referenced Feb 5, 2025

textparse: Optimized protobuf parser with custom streaming unmarshal. prometheus/prometheus#15731

Merged

incident: prombench can't build Prometheus from main (stuck on react-app npm install) #832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cancel reliability by force delete and continuing partial delete #831

Improve cancel reliability by force delete and continuing partial delete #831

bwplotka commented Feb 5, 2025

Improve cancel reliability by force delete and continuing partial delete #831

Improve cancel reliability by force delete and continuing partial delete #831

Comments

bwplotka commented Feb 5, 2025