You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have cluster-autoscaler installed on our Kubernetes cluster to automatically scale up nodes when none are available.
If I start a Workflow which runs a Pod, and then stop that workflow before the Pod starts running (so, while the Pod is in a Pending state, waiting to be scheduled onto a node), the Workflow stops fine but the Pod continues running until all the containers in it complete.
It takes a few minutes for the autoscaler to bring up a new node, so if someone triggers a workflow, then aborts it in the few minutes it takes to scale up the node and schedule the pod, then the pod may end up in a zombie state where it's running while the Workflow has already been stopped.
Version
3.4.3
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Any workflow that triggers a Pod should work. I triggered a workflow `postman-test-5lrl7`
Hey @rajaie-sg - can you retest with #10523 in the latest commits? We think that it might be resolved from proper system call handling. Let us know the result
Hi @JPZ13 - that PR was already merged when I tested with latest, so I don't think it fixed the issue. It seems that PR has more to do with Pods that are already running, but in the scenario I described, we are stopping the Workflow before the Pod has been scheduled onto a node (still in Pending status).
Pre-requisites
:latest
What happened/what you expected to happen?
We have cluster-autoscaler installed on our Kubernetes cluster to automatically scale up nodes when none are available.
If I start a Workflow which runs a Pod, and then stop that workflow before the Pod starts running (so, while the Pod is in a Pending state, waiting to be scheduled onto a node), the Workflow stops fine but the Pod continues running until all the containers in it complete.
It takes a few minutes for the autoscaler to bring up a new node, so if someone triggers a workflow, then aborts it in the few minutes it takes to scale up the node and schedule the pod, then the pod may end up in a zombie state where it's running while the Workflow has already been stopped.
Version
3.4.3
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Any workflow that triggers a Pod should work. I triggered a workflow `postman-test-5lrl7`
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: