[ECS] [bug]: ECS tasks with long timeouts become untracked by ECS Service #2485

mjcorwin · 2024-12-04T22:59:02Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

I've noticed a behavior where an ECS Service (according to the AWS Console UI) will no longer count a task against the currently running count during a deployment. This is when using ECS on EC2 with timeouts configured both in the task definition or on the ECS agent via ECS_CONTAINER_STOP_TIMEOUT configuration.

Unclear if bug or not, but tagged as such since to me it is unexpected behavior.

Which service(s) is this request for?

ECS on EC2

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We have long running tasks on ECS on EC2 configured with corresponding stop timeouts (ex: 2 hours). During a deployment, the ECS Service running task count does not reflect the number of actively running tasks appearing to stop considering old tasks (ones with non standard stop timeouts). Additionally, the resources held by the still running old tasks are not reflected in the infrastructure view.

This is problematic as it prevents scaling to occur. Or the ECS Service believes there is enough capacity available in the cluster, but fails to schedule new tasks. This is because the instance is running a number of old tasks and there is not in fact enough cpu and/or memory to start the new task.

As a result, a deployment event not only takes a long time to run but leaves our capacity misaligned. Either we are under provisioned as we cannot respond to increased demand/scaling alerts or over provisioned as new tasks are interleaved with tasks to be stopped leaving excess nodes (although this could always be the case). Initially we are more concerned with the under provisioned scenario.

Are you currently working around this issue?

Not implemented yet but plan on testing using a new service per deployment. The new service would be launched and have its desired capacity set to match old service before signaling old service to shut down. This should scale in new nodes with primarily new tasks, allowing for older tasks and corresponding nodes to be reaped once work is done.

Additional context

We have used ECS Task Protection in the past, but this prevents ECS tasks from receiving any signals. We are using SIGTERM to indicate our long running tasks should not take on new work, finish existing work, and then exit. Moving to EC2 we thought we could leverage long stop timeouts and ASG lifecycle events to essentially provide the same guarantee.

Attachments

Including screenshots of containers/tasks running on instance as compared to what ECS Service reports.