Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation with most recent ecs-optimized AMI #305

Closed
arjun810 opened this issue Feb 5, 2016 · 9 comments
Closed

Performance degradation with most recent ecs-optimized AMI #305

arjun810 opened this issue Feb 5, 2016 · 9 comments

Comments

@arjun810
Copy link

arjun810 commented Feb 5, 2016

I recently updated our cluster to use the 2015.09.e AMI in us-west-2 (ami-1ead497e).

This led to several performance degradations. It didn't seem that this was on every machine, but on many.
I tried to post this over on the ECS forums, but it wasn't letting me post.

I recently updated our cluster to use the 2015.09.e AMI in us-west-2 (ami-1ead497e).

This led to several performance issues. It didn't seem that this was on every machine, but on many.

  1. Tasks would stay in the 'pending' state much longer
  2. Running any docker commands would take a really long time. 'docker ps' would often just hang until I killed it a minute later.
  3. The tasks that were scheduled by ecs-agent would often take significantly longer than expected.
  4. When looking at CPU utilization, the misbehaving instances would often always be < 5% CPU utilization even though they were given a task that should've consumed the entire CPU until completion.

Downgrading to 2015.09.c seemed to get rid of the issue, but we'd like to use a newer version of docker.

Has anyone seen anything similar?

@mikealbert
Copy link

We are seeing the same hanging issues with 'docker ps' on the latest AMI. Seems like it may be due to this bug in Docker 1.9:

moby/moby#19328

@arjun810
Copy link
Author

arjun810 commented Feb 6, 2016

Yeah, that seems like a possible candidate. It's definitely not just docker ps though, and it's actually affecting the runtime of tasks as well.

@arjun810 arjun810 closed this as completed Feb 6, 2016
@arjun810 arjun810 reopened this Feb 6, 2016
@hridyeshpant
Copy link

This is the same issue we are also facing after moving to "amzn-ami-2015.09.d-amazon-ecs-optimized"

#296 (comment)

@nzoschke
Copy link

I also encountered serious problems on AMI upgrade. You can see a full account of my working through it here:

https://github.com/convox/rack/issues/314

The root cause was upgrading to the .d AMI and missing this important note in the Launching an Amazon ECS Container Instance docs:

If you are using the 2015.09.d or later Amazon ECS-optimized AMI, your instance has two volumes configured. The Root volume is for the operating system's use, and the second Amazon EBS volume (attached to /dev/xvdcz) is for Docker's use.

The worst problem was on some workloads the docker volume was getting to 100% full and locking the filesystem and docker up.

Even after implementing a better container/image cleaner, I am seeing some docker ps lockups. Thanks for the tip there @mikealbert.

@samuelkarp
Copy link
Contributor

@arjun810 Thanks for reporting this issue. Do you have logs that you can provide from the ECS Agent and the Docker daemon as well as the output of docker info, sudo vgdisplay, sudo lvdisplay, and anything that looks suspicious in dmesg?

@samuelkarp
Copy link
Contributor

@arjun810 Are you still seeing this problem with the most recent AMI (2015.09.g)?

@samuelkarp
Copy link
Contributor

@arjun810 Since we haven't heard back from you, I'm going to close this issue. If you do get the chance to gather the debugging information I've requested, please feel free to reopen this issue.

@arjun810
Copy link
Author

Sorry for taking a bit to check it out. I did just try out the new image and haven't experienced any issues so far. Thanks so much!

@arjun810
Copy link
Author

Actually, it looks like I spoke too soon. I upgraded our cluster to this image last night, and now we're seeing similar issues sporadically. Specifically, tasks that run randomly take significantly longer to complete.

When I ssh in, I sometimes (but rarely) see docker ps hang. I also see docker run hang sometimes. It seems quite sporadic when I see symptoms.

However, I can definitely say that it's not as stable as the old image I was using -- we used that for weeks without issue.

I'll collect the requested logs, but the intermittent nature is pretty strange. I don't see high cpu usage when docker commands hang or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants