Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docker 1.7.1] Task stuck at PENDING #300

Closed
shibai opened this issue Feb 1, 2016 · 12 comments
Closed

[Docker 1.7.1] Task stuck at PENDING #300

shibai opened this issue Feb 1, 2016 · 12 comments

Comments

@shibai
Copy link

shibai commented Feb 1, 2016

Hi there,

I am running DynamoDB Cross-Region Replication from here, but with the changes you provided here in which it uses ecs-init with Docker 1.7.1. The problem appears after it runs for about 3 days.

The problem is
#1 one of the EC2 instance crashes(or stops), it shuts down the task running on it, but it doesn't de-register itself from ECS.
#2 ECS starts a new task on that failing instance
#3 Autoscaling terminates that old instance and launches a new one. The new one registers.

In step2, the task is always on PENDING
1
2
3
4
5

This issue also happens when I manually turn one of the instance into STOP in EC2 console.
Thanks,
Shibai

@davidkelley
Copy link

We're also seeing this problem frequently. Running around ~200 tasks there
are always about 10-20 tasks stuck on PENDING due to the same problem as
described above.

On Monday, 1 February 2016, shibai notifications@github.com wrote:

Hi there,

I am running DynamoDB Cross-Region Replication from here
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.Walkthrough.Step2.html,
but with the changes you provided here
#277 in which it uses
ecs-init with Docker 1.7.1. The problem appears after it runs for about 3
days.

The problem is
#1 #1 one of the EC2
instance crashes(or stops), it shuts down the task running on it, but it
doesn't de-register itself from ECS.
#2 #2 ECS starts a new
task on that failing instance
#3 https://github.com/aws/amazon-ecs-agent/issues/3 Autoscaling
terminates that old instance and launches a new one. The new one registers.

In step2, the task is always on PENDING
[image: 1]
https://cloud.githubusercontent.com/assets/2662202/12730455/ec04e444-c8e0-11e5-879c-893b313a637e.png
[image: 2]
https://cloud.githubusercontent.com/assets/2662202/12730457/ec0d792e-c8e0-11e5-95b1-303253573246.png
[image: 3]
https://cloud.githubusercontent.com/assets/2662202/12730458/ec15fc5c-c8e0-11e5-84c9-6f4e96f091b8.png
[image: 4]
https://cloud.githubusercontent.com/assets/2662202/12730456/ec08070a-c8e0-11e5-8f11-745f665cb8fb.png
[image: 5]
https://cloud.githubusercontent.com/assets/2662202/12730598/808a1af8-c8e1-11e5-9f1d-94cddae0d289.png

This issue also happens when I manually turn one of the instance into STOP
in EC2 console.
Thanks,
Shibai


Reply to this email directly or view it on GitHub
#300.

@ghost
Copy link

ghost commented Feb 4, 2016

Running only seven tasks on a t2.small container instance, I am seeing this on nearly every attempted deployment / service update. At the same time "docker ps" is taking nearly forever to complete:

real 1m21.857s
user 0m0.024s
sys 0m0.000s

There is an issue for Docker 1.9 that may be related here. It seems the agent's frequent API calls could be causing this issue? We are running ami-9886a0f2 on the container instance with Docker 1.9.1

@samuelkarp
Copy link
Contributor

@davidkelley @grantatsyncbak Are you seeing this with Docker 1.7.1 (like @shibai) or with Docker 1.9.1?

@gkeiser
Copy link

gkeiser commented Feb 18, 2016

We are using ami-9886a0f2 with Docker 1.9.1.

@gkeiser
Copy link

gkeiser commented Feb 18, 2016

This has been the case for some time, and docker is unresponsive, will not even stop : "sudo service docker stop"

The latest log entries:

D, Known Sent: NONE"
2016-02-16T10:55:52Z [INFO] Sending container change module="eventhandler" event="ContainerChange: arn:aws:ecs:us-east-1:501027711207:task/185eaaf6-7d17-4d45-9fb7-fe0ddbb0b06a ImageServiceDev -> STOPPED, Reason CannotPullContainerError: dial unix /var/run/docker.sock: too many open files, Known Sent: NONE" change="ContainerChange: arn:aws:ecs:us-east-1:501027711207:task/185eaaf6-7d17-4d45-9fb7-fe0ddbb0b06a ImageServiceDev -> STOPPED, Reason CannotPullContainerError: dial unix /var/run/docker.sock: too many open files, Known Sent: NONE"
2016-02-16T10:55:52Z [INFO] Saving state! module="statemanager"
2016-02-16T10:55:52Z [ERROR] Error saving state; could not create temp file to save state module="statemanager" err="open /data/tmp_ecs_agent_data840630121: too many open files"

@samuelkarp
Copy link
Contributor

@gkeiser Can you open a separate issue? (Or maybe what you're seeing is related to #313?) I'd like to keep this issue focused on @shibai's issue with Docker 1.7.1.

@samuelkarp samuelkarp changed the title Task stuck at PENDING [Docker 1.7.1] Task stuck at PENDING Feb 19, 2016
@aaithal
Copy link
Contributor

aaithal commented May 2, 2016

@shibai Please let us know if you are still seeing this issue on the latest AMI.

@ghost
Copy link

ghost commented Jun 13, 2016

@shibai Thank you for reporting this. We are aware of an issue where the ECS scheduler may place tasks on instances that are in the process of stopping. We are investigating this issue and will provide an update as soon as we have more to share.

@jawang35
Copy link

jawang35 commented Dec 7, 2016

@MarcelvR Is there any news on this? I am experiencing the same problem with my task instances in an unstable state of going up and down.

@milla
Copy link

milla commented Jun 9, 2017

I am experiencing the same problem

@zaakiy
Copy link

zaakiy commented Jul 24, 2017

I am also experiencing this same problem on ECS. It has been working fine for 2 months, but now experiencing the issue exactly as described above.
Agent version 1.14.3
Docker version 17.03.1-ce

Update: I suspect it was a problem with the ECS Container Host. I launched another container host into the cluster and it seemed to handle it fine (i.e., no long delays in PENDING state). Terminated the old host, and now I don't seem to be having problems.

@aaithal aaithal assigned richardpen and unassigned richardpen Jul 25, 2017
@adnxn
Copy link
Contributor

adnxn commented Nov 28, 2017

@shibai This original issue appears to be related to reaping zombie tasks on unresponsive instances. I'm closing this in favor of #1115

@adnxn adnxn closed this as completed Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants