You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request
I've noticed a behavior where an ECS Service (according to the AWS Console UI) will no longer count a task against the currently running count during a deployment. This is when using ECS on EC2 with timeouts configured both in the task definition or on the ECS agent via ECS_CONTAINER_STOP_TIMEOUT configuration.
Unclear if bug or not, but tagged as such since to me it is unexpected behavior.
Which service(s) is this request for?
ECS on EC2
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We have long running tasks on ECS on EC2 configured with corresponding stop timeouts (ex: 2 hours). During a deployment, the ECS Service running task count does not reflect the number of actively running tasks appearing to stop considering old tasks (ones with non standard stop timeouts). Additionally, the resources held by the still running old tasks are not reflected in the infrastructure view.
This is problematic as it prevents scaling to occur. Or the ECS Service believes there is enough capacity available in the cluster, but fails to schedule new tasks. This is because the instance is running a number of old tasks and there is not in fact enough cpu and/or memory to start the new task.
As a result, a deployment event not only takes a long time to run but leaves our capacity misaligned. Either we are under provisioned as we cannot respond to increased demand/scaling alerts or over provisioned as new tasks are interleaved with tasks to be stopped leaving excess nodes (although this could always be the case). Initially we are more concerned with the under provisioned scenario.
Are you currently working around this issue?
Not implemented yet but plan on testing using a new service per deployment. The new service would be launched and have its desired capacity set to match old service before signaling old service to shut down. This should scale in new nodes with primarily new tasks, allowing for older tasks and corresponding nodes to be reaped once work is done.
Additional context
We have used ECS Task Protection in the past, but this prevents ECS tasks from receiving any signals. We are using SIGTERM to indicate our long running tasks should not take on new work, finish existing work, and then exit. Moving to EC2 we thought we could leverage long stop timeouts and ASG lifecycle events to essentially provide the same guarantee.
Attachments
Including screenshots of containers/tasks running on instance as compared to what ECS Service reports.
The text was updated successfully, but these errors were encountered:
Community Note
Tell us about your request
I've noticed a behavior where an ECS Service (according to the AWS Console UI) will no longer count a task against the currently running count during a deployment. This is when using ECS on EC2 with timeouts configured both in the task definition or on the ECS agent via ECS_CONTAINER_STOP_TIMEOUT configuration.
Unclear if bug or not, but tagged as such since to me it is unexpected behavior.
Which service(s) is this request for?
ECS on EC2
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We have long running tasks on ECS on EC2 configured with corresponding stop timeouts (ex: 2 hours). During a deployment, the ECS Service running task count does not reflect the number of actively running tasks appearing to stop considering old tasks (ones with non standard stop timeouts). Additionally, the resources held by the still running old tasks are not reflected in the infrastructure view.
This is problematic as it prevents scaling to occur. Or the ECS Service believes there is enough capacity available in the cluster, but fails to schedule new tasks. This is because the instance is running a number of old tasks and there is not in fact enough cpu and/or memory to start the new task.
As a result, a deployment event not only takes a long time to run but leaves our capacity misaligned. Either we are under provisioned as we cannot respond to increased demand/scaling alerts or over provisioned as new tasks are interleaved with tasks to be stopped leaving excess nodes (although this could always be the case). Initially we are more concerned with the under provisioned scenario.
Are you currently working around this issue?
Not implemented yet but plan on testing using a new service per deployment. The new service would be launched and have its desired capacity set to match old service before signaling old service to shut down. This should scale in new nodes with primarily new tasks, allowing for older tasks and corresponding nodes to be reaped once work is done.
Additional context
We have used ECS Task Protection in the past, but this prevents ECS tasks from receiving any signals. We are using SIGTERM to indicate our long running tasks should not take on new work, finish existing work, and then exit. Moving to EC2 we thought we could leverage long stop timeouts and ASG lifecycle events to essentially provide the same guarantee.
Attachments
Including screenshots of containers/tasks running on instance as compared to what ECS Service reports.
The text was updated successfully, but these errors were encountered: