You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For non ephemeral runners the status of the workflow job is checked, and only for queued jobs scaling is done. For ephemeral runners this check is not applied because the assumption was that every job needs a runners.
We found out when we start scaling with a couple of 100 runners this ideas was not working as expected. When we got a large number of cancelled jobs. For example based on a job time out. The events are still on queue. This is typically the case when we have reach the max of runners. The lambda's will crate all the runenrs. But they will remain idle since jobs are canclled. This is not a problem with a few cancelled jobs. But when having hugh amount of cancelled jobs, this could casue a large fleet of useless runners.
Solution
We have tested a modified scale up lambda, where we applied the the check for the job in the same way as for non ephemeral runners. In our case this was solving the problem. However, since there is not correlation between job and runner this approach could lead that events are not used for scaling in cases they should lead to scaling. As mitigation we have a very small fleet of runners in the pool to keep track of those missed events.
The text was updated successfully, but these errors were encountered:
I'm working on an adjacent solution using Firecracker and pools of agents, but solely with ephemeral runners to ensure complete isolation and a fresh environment for each run.
My question for you is: if you were only using emphemeral runners and creating new VMs for each workflow job event, how do you handle a cancelled workflow run? Let's say that your run created 20 jobs, so 20 VMs were started.
If each job is allocated to a runner, starts executing, then the run is cancelled, then the runner exits and everything is cleaned up.
But the challenge is if that run and its 20 jobs are cancelled before being allocated to a runner. At that point we have 20 VMs running and no good way to knowing to shut them down or to reap them.
Description
For non ephemeral runners the status of the workflow job is checked, and only for queued jobs scaling is done. For ephemeral runners this check is not applied because the assumption was that every job needs a runners.
We found out when we start scaling with a couple of 100 runners this ideas was not working as expected. When we got a large number of cancelled jobs. For example based on a job time out. The events are still on queue. This is typically the case when we have reach the max of runners. The lambda's will crate all the runenrs. But they will remain idle since jobs are canclled. This is not a problem with a few cancelled jobs. But when having hugh amount of cancelled jobs, this could casue a large fleet of useless runners.
Solution
We have tested a modified scale up lambda, where we applied the the check for the job in the same way as for non ephemeral runners. In our case this was solving the problem. However, since there is not correlation between job and runner this approach could lead that events are not used for scaling in cases they should lead to scaling. As mitigation we have a very small fleet of runners in the pool to keep track of those missed events.
The text was updated successfully, but these errors were encountered: