You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.
I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.
21:52:25 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 provider.go:71: Request for 'applying deployment:prometheus-test-pr-15731' is in progress. Checking in 10s
21:52:35 gke.go:582: error while applying a resource err:error applying '/tmp/tmp.NOMpbH/test-infra/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml' err: Request for 'applying deployment:prometheus-test-pr-15731' hasn't completed after retrying 50 times
make: *** [Makefile:88: resource_apply] Error 1
When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building.
22:19:05 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 provider.go:71: Request for 'deleting namespace:prombench-15731' is in progress. Checking in 10s
22:19:15 gke.go:590: error while deleting objects from a manifest file err:error deleting './manifests/prombench/benchmark/1a_namespace.yaml' err: Request for 'deleting namespace:prombench-[157](https://github.com/prometheus/prometheus/actions/runs/13145657052/job/36683090504#step:4:158)31' hasn't completed after retrying 100 times
make: *** [Makefile:97: resource_delete] Error 1
Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck Prometheus pod - now waiting for termination forever (at least 8h).
Pods being stuck in creation and termination can happen. Cancel also can fail in some mid-step.
I propose we make cancel more reliable by adding mechanism to the cancel script to force deletion of various resources after some timeout (or force it always). Then we need to ensure that errors where some resource is missing is assumed as success as partial installation or partial deletions can happen.
This happened recently:
When I logged on GKE it looks like Pod is pending (waiting for Prometheus to start) and init container is green, yet no logs other than some npm warnings from the building.
Triggered prombench cancel
Cancel fails
Logged again to GKE and saw everything for the benchmark is deleted except namespace and previously stuck Prometheus pod - now waiting for termination forever (at least 8h).
I forced deleted it from GKE shell:
Ideally steps and (5) and (8) does not fail so resetting the benchmark is easier and perhaps faster.
The text was updated successfully, but these errors were encountered: