Performance degradation with most recent ecs-optimized AMI #305

arjun810 · 2016-02-05T20:57:12Z

I recently updated our cluster to use the 2015.09.e AMI in us-west-2 (ami-1ead497e).

This led to several performance degradations. It didn't seem that this was on every machine, but on many.
I tried to post this over on the ECS forums, but it wasn't letting me post.

I recently updated our cluster to use the 2015.09.e AMI in us-west-2 (ami-1ead497e).

This led to several performance issues. It didn't seem that this was on every machine, but on many.

Tasks would stay in the 'pending' state much longer
Running any docker commands would take a really long time. 'docker ps' would often just hang until I killed it a minute later.
The tasks that were scheduled by ecs-agent would often take significantly longer than expected.
When looking at CPU utilization, the misbehaving instances would often always be < 5% CPU utilization even though they were given a task that should've consumed the entire CPU until completion.

Downgrading to 2015.09.c seemed to get rid of the issue, but we'd like to use a newer version of docker.

Has anyone seen anything similar?

mikealbert · 2016-02-05T20:59:31Z

We are seeing the same hanging issues with 'docker ps' on the latest AMI. Seems like it may be due to this bug in Docker 1.9:

moby/moby#19328

arjun810 · 2016-02-06T02:34:13Z

Yeah, that seems like a possible candidate. It's definitely not just docker ps though, and it's actually affecting the runtime of tasks as well.

hridyeshpant · 2016-02-06T23:10:59Z

This is the same issue we are also facing after moving to "amzn-ami-2015.09.d-amazon-ecs-optimized"

#296 (comment)

nzoschke · 2016-02-12T16:11:19Z

I also encountered serious problems on AMI upgrade. You can see a full account of my working through it here:

https://github.com/convox/rack/issues/314

The root cause was upgrading to the .d AMI and missing this important note in the Launching an Amazon ECS Container Instance docs:

If you are using the 2015.09.d or later Amazon ECS-optimized AMI, your instance has two volumes configured. The Root volume is for the operating system's use, and the second Amazon EBS volume (attached to /dev/xvdcz) is for Docker's use.

The worst problem was on some workloads the docker volume was getting to 100% full and locking the filesystem and docker up.

Even after implementing a better container/image cleaner, I am seeing some docker ps lockups. Thanks for the tip there @mikealbert.

samuelkarp · 2016-02-18T01:26:26Z

@arjun810 Thanks for reporting this issue. Do you have logs that you can provide from the ECS Agent and the Docker daemon as well as the output of docker info, sudo vgdisplay, sudo lvdisplay, and anything that looks suspicious in dmesg?

samuelkarp · 2016-03-17T20:10:16Z

@arjun810 Are you still seeing this problem with the most recent AMI (2015.09.g)?

samuelkarp · 2016-03-28T19:18:18Z

@arjun810 Since we haven't heard back from you, I'm going to close this issue. If you do get the chance to gather the debugging information I've requested, please feel free to reopen this issue.

arjun810 · 2016-03-29T07:27:38Z

Sorry for taking a bit to check it out. I did just try out the new image and haven't experienced any issues so far. Thanks so much!

arjun810 · 2016-03-31T02:28:56Z

Actually, it looks like I spoke too soon. I upgraded our cluster to this image last night, and now we're seeing similar issues sporadically. Specifically, tasks that run randomly take significantly longer to complete.

When I ssh in, I sometimes (but rarely) see docker ps hang. I also see docker run hang sometimes. It seems quite sporadic when I see symptoms.

However, I can definitely say that it's not as stable as the old image I was using -- we used that for weeks without issue.

I'll collect the requested logs, but the intermittent nature is pretty strange. I don't see high cpu usage when docker commands hang or anything.

arjun810 closed this as completed Feb 6, 2016

arjun810 reopened this Feb 6, 2016

trharris78 mentioned this issue Feb 8, 2016

Tasks stuck in PENDING with containers stuck in Created state #306

Closed

samuelkarp added the more info needed label Feb 18, 2016

samuelkarp closed this as completed Mar 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation with most recent ecs-optimized AMI #305

Performance degradation with most recent ecs-optimized AMI #305

arjun810 commented Feb 5, 2016

mikealbert commented Feb 5, 2016

arjun810 commented Feb 6, 2016

hridyeshpant commented Feb 6, 2016

nzoschke commented Feb 12, 2016

samuelkarp commented Feb 18, 2016

samuelkarp commented Mar 17, 2016

samuelkarp commented Mar 28, 2016

arjun810 commented Mar 29, 2016

arjun810 commented Mar 31, 2016

Performance degradation with most recent ecs-optimized AMI #305

Performance degradation with most recent ecs-optimized AMI #305

Comments

arjun810 commented Feb 5, 2016

mikealbert commented Feb 5, 2016

arjun810 commented Feb 6, 2016

hridyeshpant commented Feb 6, 2016

nzoschke commented Feb 12, 2016

samuelkarp commented Feb 18, 2016

samuelkarp commented Mar 17, 2016

samuelkarp commented Mar 28, 2016

arjun810 commented Mar 29, 2016

arjun810 commented Mar 31, 2016