Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks stuck in PENDING with containers stuck in Created state #306

Closed
trharris78 opened this issue Feb 8, 2016 · 2 comments
Closed

Tasks stuck in PENDING with containers stuck in Created state #306

trharris78 opened this issue Feb 8, 2016 · 2 comments

Comments

@trharris78
Copy link

I'm using RunTask to schedule hundreds of tasks on a small ECS cluster. The tasks are short-lived, maybe a minute or two at most. Tasks that I can't schedule with a RunTask call are left on a queue and I try the RunTask call again later. This seems to work for awhile, but eventually the cluster gets bogged down and tasks get stuck in the PENDING state. Running docker ps -a shows tons of containers in the Created state and they seem to be stuck indefinitely.

I'm also using DescribeTasks to check on task states, and I'm getting DockerTimeoutError and CannotInspectContainerError fairly often. These start to happen at about the same time the containers start to get stuck in the Created state. Also, docker ps -a becomes unresponsive, sometimes sitting there for 10 minutes or more before I hit Ctrl+C.

I originally thought that the Docker daemon was getting overwhelmed with hundreds of exited containers, so I built the amazon-ecs-agent dev branch to try the new ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION variable. Containers now get cleaned up after a few minutes, but the PENDING problem persists.

I also tried updating to Docker 1.10.0 but that didn't help, either.

I'm using the ECS-optimized AMI in us-east-1, ami-cb2305a1, which includes Docker 1.9.1. I see similar issues #296, #300, and #305 that might be related.

Docker and ecs-agent logs are attached.

logs.zip

@samuelkarp
Copy link
Contributor

Hi @trharris78, thanks for providing logs.

Your Docker logs show a panic that looks related to moby/moby#18481. The Agent logs show lines like this:

2016-02-08T17:00:00Z [INFO] Error transitioning container module="TaskEngine" task="xxxx:2 arn:aws:ecs:us-east-1:xxxx:task/xxxx, Status: (NONE->RUNNING) Containers: [xxxx (PULLED->RUNNING),]" container="xxxx(xxxx/xxxx) (PULLED->RUNNING)" state="CREATED"
2016-02-08T17:00:00Z [WARN] Error with docker; stopping container module="TaskEngine" task="xxxx:2 arn:aws:ecs:us-east-1:xxxx:task/xxxx, Status: (NONE->RUNNING) Containers: [xxxx (CREATED->RUNNING),]" container="xxxx(xxxx/xxxx) (CREATED->RUNNING)" err="Could not transition to created; timed out after waiting 3m0s"

The Agent logs are consistent with what I'd expect from that Docker issue: the Agent is attempting to talk to the Docker daemon but it's not being responsive. Can you provide the output of docker info, sudo vgdisplay, sudo lvdisplay, and anything that looks suspicious in dmesg? I think this is unrelated to #300 (since that's Docker 1.7.1 and related to behavior around instance termination); this could be related to the other two issues you linked but that's as yet unclear.

@trharris78
Copy link
Author

I'm unable to provide the requested additional logs because we rearchitected our system to use services instead of tasks. Perhaps in the future we will revisit this.

edibble21 pushed a commit to edibble21/amazon-ecs-agent that referenced this issue Jul 9, 2021
docker: add '/etc/alternatives' to mounts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants