Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher memory usage since upgrading ami #507

Closed
dza89 opened this issue Jul 10, 2020 · 21 comments
Closed

Higher memory usage since upgrading ami #507

dza89 opened this issue Jul 10, 2020 · 21 comments
Assignees

Comments

@dza89
Copy link

dza89 commented Jul 10, 2020

Summary

Edited by @mmerkes from AWS

Some customers are reporting increased memory usage in pods when migrating from EKS managed AMIs built in May 2020 or earlier to newer AMIs. We haven't been able to reproduce the issue yet, so additional information would be helpful. See below for possible helpful information. You can add information to this issue or open an AWS support case to share the information. AWS Support can share information with the service team directly. Possible causes: Docker upgrade, kernel 4.14 issue.

  1. Did you try updating the kernel to 5.4 as described in this soft lockup issue?
  • I don't know that they're related, but just curious if others are seeing both issues.
  1. What kind of workloads are you running on the nodes? As much detail as possible would be helpful.
  • i.e. What pods are running in the nodes? What do they do?
  • High CPU usage? High memory? High IOPS?
  1. What's your base container? Is it based on Ubuntu, Centos, etc? Do you have any specific setup?
  • Pod spec yaml would be super helpful, if part or all can be shared.
  1. What EC2 instance types are you noticing this on?
  2. Are you noticing this with all nodes or just nodes with specific pods running on it?

Original Post

Provided by @dza89

What happened:

We've upgraded from v20200507 to v20200618.
Since the upgrade we are experiencing higher memory usage on all our pods.

Example:
Prometheus operator goes from 500mb in idle to 1.4GB
image

I would like some help how to debug this further.

The only notable difference is the docker version from 18.09.9ce-2.amzn2 to 19.03.6ce-4.amzn2

What you expected to happen:

No higher memory usage

How to reproduce it (as minimally and precisely as possible):

Upgrade from v20200507 to v20200618.

Anything else we need to know?:

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): all
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.2
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.16
  • AMI Version: see above
@sbkg0002
Copy link

Any update on this one? We have this in all our environments.

@nitrag
Copy link

nitrag commented Jul 29, 2020

I'm seeing additional memory issues as well.

@naineel
Copy link

naineel commented Aug 5, 2020

We are seeing this issue as well. Is the docker version upgrade causing this ?

@cshivashankar
Copy link

I have also started observing memory issues.
Any updates on this.

@cshivashankar
Copy link

Is anyone facing PLEG is not healthy issue or SystemOOM issue after using docker daemon 19.03.
This issue was nonexistent before the docker version upgrade.

@mmerkes
Copy link
Member

mmerkes commented Aug 31, 2020

@dza89 @nitrag @naineel @cshivashankar @sbkg0002 I'm working on reproducing this. My first attempts were unsuccessful. I'm going to keep working at it, but I have a few questions:

  1. Is anybody experiencing this on 1.17 AMIs?
  2. Is anybody experiencing this on the latest 1.16 AMI? 1.16.13-20200814
  3. How many nodes are you running in your cluster?
  4. Are you seeing this issue on 100% of your pods, including non-Prometheus pods?

If you have any other information regarding your setup that you think might be useful, let me know! If I get a repro, I'll update this issue accordingly.

@nitrag
Copy link

nitrag commented Aug 31, 2020

@mmerkes

Cluster A: (dev cluster - Moderate activity, used to be lower)
EKS: 1.15
EKS Node: v1.15.11-eks-065dce
Memory: 432 Mib
Total Pods: 60
Total Nodes: 9

Cluster B: (qa cluster - low activity)
EKS: 1.15
EKS Node: v1.14.9-eks-658790
Memory: 256 Mib
Total Pods: 60
Total Nodes: 9

Cluster B: (Production cluster - high memory, seeing evictions in kube-system node group due to it)
EKS: 1.15
EKS Node: v1.15.11-eks-065dce
Memory: 716 Mib
Total Pods: 170
Total Nodes: 16

Want me to upgrade the QA node to 1.15 to see if memory increases? The increases memory utilization is only evident on aws-cluster-autoscaler. No attempts at 1.16 or 1.17.

@mmerkes
Copy link
Member

mmerkes commented Aug 31, 2020

@nitrag Sure, that would be helpful. Thought I had a repro, but was just misreading things. If you can verify that the memory increases, that would be helpful.

Basically, I created two 1.16 clusters and created managed node groups with v20200507 in one and v20200618 in the other. I installed Prometheus and a random container to spin up some pods, and the memory usage looks the same across both clusters.

Also, what specifically is the memory metric from? Want to make sure I'm using the same metric.

@cshivashankar
Copy link

@dza89 @nitrag @naineel @cshivashankar @sbkg0002 I'm working on reproducing this. My first attempts were unsuccessful. I'm going to keep working at it, but I have a few questions:

  1. Is anybody experiencing this on 1.17 AMIs?
  2. Is anybody experiencing this on the latest 1.16 AMI? 1.16.13-20200814
  3. How many nodes are you running in your cluster?
  4. Are you seeing this issue on 100% of your pods, including non-Prometheus pods?

If you have any other information regarding your setup that you think might be useful, let me know! If I get a repro, I'll update this issue accordingly.

@mmerkes

  1. I am running 1.15 ,1.14,1.13 nodes
  2. Have not used it
  3. 25
  4. I will check and come back.

Other observation is , I am getting lot of PLEG issues after using AMI which has 19.03 docker .Till that time nodes were never flipping to NotReady state.

@nitrag
Copy link

nitrag commented Sep 1, 2020

@mmerkes

I am ssh'ing into the node and running docker stats to grab Memory Usage.

I just upgraded/replaced the node and it's now running v1.15.11-eks-065dce. Memory usage is 301 Mib. So 18% higher utilization. This is without any load on the application (cluster-autoscaler v1.15.7).

@mmerkes
Copy link
Member

mmerkes commented Oct 1, 2020

@cshivashankar I know you've been communicating on the soft lockup issue that we suspect is related to something in the Linux kernel. I'm trying to rule in or rule out a relation between the two issues as I've seen multiple customers experience both of them around the same time. More specifically, I'm curious if you've tried upgrading the kernel via amazon-linux-extras install kernel-ng and noticed any difference in this issue.

@cshivashankar
Copy link

Hi, @mmerkes Thanks for reaching out. Yes, soft lockup and PLEG issues are giving a lot of troubles in my environment.
I have tried upgrading the kernel in my test environment but not production. However, the soft lockup issues are only being observed in production, not test environment. Despite trying in multiple ways to simulate error in the testing environment it's not happening.So I cannot confirm if upgrading the kernel made any difference. If required I can try this in production and update.

Another weird error I observed of late was, a container was not cleaned up due to the orphaned pod and nodes were flipping from NodeReady to NodeNotReady once every 20/30 mins. When investigated I found that the container related to the orphaned pod was not responsive, What's really surprising was why Nodes were flipping every 20 mins rather than constantly Ready or NotReady.This could have happened due to issues in garbage collection or controllers not being handled well by Cgroups.
This issue might not be related but sharing my thoughts in case it helps.

@mmerkes
Copy link
Member

mmerkes commented Oct 2, 2020

Kernel 4.14.198-152.320.amzn2 includes patches that we hope resolves the some of the issues and is now available via yum update kernel. You could try updating the kernel and see if that fixes your issue.

This hasn't resolved the soft lockup issue, so I think it's unlikely to resolve this memory issue.

@mmerkes
Copy link
Member

mmerkes commented Oct 13, 2020

I'm still working on this issue, but haven't been able to reproduce, so I'd like additional information if you are willing to share. Feel free to open an AWS support case to share the information. AWS Support can share information with me directly.

  1. Did you try updating the kernel to 5.4 as described in this soft lockup issue?
  • I don't know that they're related, but just curious if others are seeing both issues.
  1. What kind of workloads are you running on the nodes? As much detail as possible would be helpful.
  • i.e. What pods are running in the nodes? What do they do?
  • High CPU usage? High memory? High IOPS?
  1. What's your base container? Is it based on Ubuntu, Centos, etc? Do you have any specific setup?
  • Pod spec yaml would be super helpful, if part or all can be shared.
  1. What EC2 instance types are you noticing this on?
  2. Are you noticing this with all nodes or just nodes with specific pods running on it?

@omnibs
Copy link

omnibs commented Oct 28, 2020

I'm not sure if related, but I've been migrating Haskell Web APIs from vanilla EC2 to EKS and I'm seeing 3x+ the memory usage in Kubernetes.

Apps used to weight ~60MB and now don't live below 180MB (and often go above 600MB). Both measurements are taken from Datadog's Processes view, which probably uses ps.

I'm running 1.17 and using two AMIs:

  • amazon-eks-node-1.17-v20200710 w/ docker 19.3.6
  • amazon-eks-node-1.17-v20200710 rebuilt to use docker 18.9.9

Both are showing the same pattern.

I'm planning on upgrading to 1.18 and I can grab the new kernel then, to confirm whether this is still an issue.

@mmerkes
Copy link
Member

mmerkes commented Oct 28, 2020

@omnibs Thanks for the information. Are your haskell web APIs the only thing running on those nodes and if not, are your other pods noticing a spike in memory usage as well? What AMI and Docker version were you using on vanilla EC2?

Let me know if the kernel upgrade helps. Otherwise, any other details on your setup would be helpful :)

@omnibs
Copy link

omnibs commented Oct 30, 2020

The Haskell apps are the only thing we've migrated so far. We have Daemonsets and the like, but we don't have a base for comparison for those.

We aren't using Docker on EC2, and we're using custom AMIs we built with packer based off of ami-00a0cf500e59c9f7c (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20200521).

I'll come back with more information once I've upgraded to 1.18 and got a new kernel.

@vpayala
Copy link

vpayala commented Dec 11, 2020

We are seeing this issue as well after upgrading our EKS control plane and worker nodes from 1.16 to 1.17. Over a period of 2 weeks memory utilization on the worker nodes have gradually gone up on two out of three nodes to ~95%. There is no load on the env and this pattern is observed in all worker nodes in all envs, where we upgraded to 1.17.

AMI Name: amazon-eks-node-1.17-v20201112

Worker node version,

$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-x1-x-x-x.us-west-2.compute.internal    Ready    <none>   21d   v1.17.12-eks-7684af
ip-x2-x-x-x.us-west-2.compute.internal   Ready    <none>   21d   v1.17.12-eks-7684af
ip-x3-x-x-x.us-west-2.compute.internal     Ready    <none>   21d   v1.17.12-eks-7684af

Current worker node memory usage,

kubectl top node --sort-by='memory'
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-x1-x-x-x.us-west-2.compute.internal   184m         4%     14376Mi         97%
ip-x2-x-x-x.us-west-2.compute.internal    350m         8%     13854Mi         93%
ip-x3-x-x-x.us-west-2.compute.internal     182m         4%     10911Mi         73%

Worker node memory usage since upgrading to 1.17,
image

@ghost
Copy link

ghost commented Jan 4, 2021

We've recently upgraded from 1.14 to 1.16. As we didn't spend enough time running 1.15 so I can't tell much but definitely seeing the same issue here with v1.16.15-eks-ad4801.

Has anyone here tried upgrading to 1.18?

@ghost
Copy link

ghost commented Jan 11, 2021

Hey all, in my case, it looks like upgrading to 1.18 has taken care of the memory issue. I'm using v1.18.9-eks-d1db3c

node memory usage
There's a clear difference on memory usage before and after the upgrade.

@seddarj
Copy link

seddarj commented Jan 20, 2021

Update: We confirm that upgrading to AMI amazon-eks-node-1.18-v20210112 has solved the issue for us 🎉

Hello, we've also been having these issues on amazon-eks-node-1.18-v20201112 and are currently upgrading to amazon-eks-node-1.18-v20210112 (both seem to be using v1.18.9-eks-d1db3c). We'll report back if it solves the issue for us.

On our end we observed a constant increase in node RAM usage (in orange below) while the sum of RAM usage for all pods (in cyan) remained somewhat stable.

Screenshot 2021-01-21 at 08 51 49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants