-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher memory usage since upgrading ami #507
Comments
Any update on this one? We have this in all our environments. |
I'm seeing additional memory issues as well. |
We are seeing this issue as well. Is the docker version upgrade causing this ? |
I have also started observing memory issues. |
Is anyone facing |
@dza89 @nitrag @naineel @cshivashankar @sbkg0002 I'm working on reproducing this. My first attempts were unsuccessful. I'm going to keep working at it, but I have a few questions:
If you have any other information regarding your setup that you think might be useful, let me know! If I get a repro, I'll update this issue accordingly. |
Cluster A: (dev cluster - Moderate activity, used to be lower) Cluster B: (qa cluster - low activity) Cluster B: (Production cluster - high memory, seeing evictions in kube-system node group due to it) Want me to upgrade the QA node to 1.15 to see if memory increases? The increases memory utilization is only evident on |
@nitrag Sure, that would be helpful. Thought I had a repro, but was just misreading things. If you can verify that the memory increases, that would be helpful. Basically, I created two 1.16 clusters and created managed node groups with v20200507 in one and v20200618 in the other. I installed Prometheus and a random container to spin up some pods, and the memory usage looks the same across both clusters. Also, what specifically is the memory metric from? Want to make sure I'm using the same metric. |
Other observation is , I am getting lot of PLEG issues after using AMI which has 19.03 docker .Till that time nodes were never flipping to NotReady state. |
@cshivashankar I know you've been communicating on the soft lockup issue that we suspect is related to something in the Linux kernel. I'm trying to rule in or rule out a relation between the two issues as I've seen multiple customers experience both of them around the same time. More specifically, I'm curious if you've tried upgrading the kernel via |
Hi, @mmerkes Thanks for reaching out. Yes, soft lockup and PLEG issues are giving a lot of troubles in my environment. Another weird error I observed of late was, a container was not cleaned up due to the orphaned pod and nodes were flipping from NodeReady to NodeNotReady once every 20/30 mins. When investigated I found that the container related to the orphaned pod was not responsive, What's really surprising was why Nodes were flipping every 20 mins rather than constantly Ready or NotReady.This could have happened due to issues in garbage collection or controllers not being handled well by Cgroups. |
This hasn't resolved the soft lockup issue, so I think it's unlikely to resolve this memory issue. |
I'm still working on this issue, but haven't been able to reproduce, so I'd like additional information if you are willing to share. Feel free to open an AWS support case to share the information. AWS Support can share information with me directly.
|
I'm not sure if related, but I've been migrating Haskell Web APIs from vanilla EC2 to EKS and I'm seeing 3x+ the memory usage in Kubernetes. Apps used to weight ~60MB and now don't live below 180MB (and often go above 600MB). Both measurements are taken from Datadog's Processes view, which probably uses I'm running 1.17 and using two AMIs:
Both are showing the same pattern. I'm planning on upgrading to 1.18 and I can grab the new kernel then, to confirm whether this is still an issue. |
@omnibs Thanks for the information. Are your haskell web APIs the only thing running on those nodes and if not, are your other pods noticing a spike in memory usage as well? What AMI and Docker version were you using on vanilla EC2? Let me know if the kernel upgrade helps. Otherwise, any other details on your setup would be helpful :) |
The Haskell apps are the only thing we've migrated so far. We have Daemonsets and the like, but we don't have a base for comparison for those. We aren't using Docker on EC2, and we're using custom AMIs we built with packer based off of I'll come back with more information once I've upgraded to 1.18 and got a new kernel. |
We are seeing this issue as well after upgrading our EKS control plane and worker nodes from 1.16 to 1.17. Over a period of 2 weeks memory utilization on the worker nodes have gradually gone up on two out of three nodes to ~95%. There is no load on the env and this pattern is observed in all worker nodes in all envs, where we upgraded to 1.17. AMI Name: Worker node version,
Current worker node memory usage,
|
We've recently upgraded from Has anyone here tried upgrading to |
Update: We confirm that upgrading to AMI Hello, we've also been having these issues on On our end we observed a constant increase in node RAM usage (in orange below) while the sum of RAM usage for all pods (in cyan) remained somewhat stable. |
Summary
Edited by @mmerkes from AWS
Some customers are reporting increased memory usage in pods when migrating from EKS managed AMIs built in May 2020 or earlier to newer AMIs. We haven't been able to reproduce the issue yet, so additional information would be helpful. See below for possible helpful information. You can add information to this issue or open an AWS support case to share the information. AWS Support can share information with the service team directly. Possible causes: Docker upgrade, kernel 4.14 issue.
Original Post
Provided by @dza89
What happened:
We've upgraded from v20200507 to v20200618.
Since the upgrade we are experiencing higher memory usage on all our pods.
Example:
Prometheus operator goes from 500mb in idle to 1.4GB
I would like some help how to debug this further.
The only notable difference is the docker version from 18.09.9ce-2.amzn2 to 19.03.6ce-4.amzn2
What you expected to happen:
No higher memory usage
How to reproduce it (as minimally and precisely as possible):
Upgrade from v20200507 to v20200618.
Anything else we need to know?:
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.2aws eks describe-cluster --name <name> --query cluster.version
): 1.16The text was updated successfully, but these errors were encountered: