Higher memory usage since upgrading ami #507

dza89 · 2020-07-10T12:06:06Z

Summary

Edited by @mmerkes from AWS

Some customers are reporting increased memory usage in pods when migrating from EKS managed AMIs built in May 2020 or earlier to newer AMIs. We haven't been able to reproduce the issue yet, so additional information would be helpful. See below for possible helpful information. You can add information to this issue or open an AWS support case to share the information. AWS Support can share information with the service team directly. Possible causes: Docker upgrade, kernel 4.14 issue.

Did you try updating the kernel to 5.4 as described in this soft lockup issue?

I don't know that they're related, but just curious if others are seeing both issues.

What kind of workloads are you running on the nodes? As much detail as possible would be helpful.

i.e. What pods are running in the nodes? What do they do?
High CPU usage? High memory? High IOPS?

What's your base container? Is it based on Ubuntu, Centos, etc? Do you have any specific setup?

Pod spec yaml would be super helpful, if part or all can be shared.

What EC2 instance types are you noticing this on?
Are you noticing this with all nodes or just nodes with specific pods running on it?

Original Post

Provided by @dza89

What happened:

We've upgraded from v20200507 to v20200618.
Since the upgrade we are experiencing higher memory usage on all our pods.

Example:
Prometheus operator goes from 500mb in idle to 1.4GB

I would like some help how to debug this further.

The only notable difference is the docker version from 18.09.9ce-2.amzn2 to 19.03.6ce-4.amzn2

What you expected to happen:

No higher memory usage

How to reproduce it (as minimally and precisely as possible):

Upgrade from v20200507 to v20200618.

Anything else we need to know?:

Environment:

AWS Region: eu-west-1
Instance Type(s): all
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.2
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.16
AMI Version: see above

The text was updated successfully, but these errors were encountered:

sbkg0002 · 2020-07-16T07:33:22Z

Any update on this one? We have this in all our environments.

nitrag · 2020-07-29T22:27:38Z

I'm seeing additional memory issues as well.

naineel · 2020-08-05T17:17:54Z

We are seeing this issue as well. Is the docker version upgrade causing this ?

cshivashankar · 2020-08-06T19:57:37Z

I have also started observing memory issues.
Any updates on this.

cshivashankar · 2020-08-13T10:18:46Z

Is anyone facing PLEG is not healthy issue or SystemOOM issue after using docker daemon 19.03.
This issue was nonexistent before the docker version upgrade.

mmerkes · 2020-08-31T16:12:05Z

@dza89 @nitrag @naineel @cshivashankar @sbkg0002 I'm working on reproducing this. My first attempts were unsuccessful. I'm going to keep working at it, but I have a few questions:

Is anybody experiencing this on 1.17 AMIs?
Is anybody experiencing this on the latest 1.16 AMI? 1.16.13-20200814
How many nodes are you running in your cluster?
Are you seeing this issue on 100% of your pods, including non-Prometheus pods?

If you have any other information regarding your setup that you think might be useful, let me know! If I get a repro, I'll update this issue accordingly.

nitrag · 2020-08-31T20:42:28Z

@mmerkes

Cluster A: (dev cluster - Moderate activity, used to be lower)
EKS: 1.15
EKS Node: v1.15.11-eks-065dce
Memory: 432 Mib
Total Pods: 60
Total Nodes: 9

Cluster B: (qa cluster - low activity)
EKS: 1.15
EKS Node: v1.14.9-eks-658790
Memory: 256 Mib
Total Pods: 60
Total Nodes: 9

Cluster B: (Production cluster - high memory, seeing evictions in kube-system node group due to it)
EKS: 1.15
EKS Node: v1.15.11-eks-065dce
Memory: 716 Mib
Total Pods: 170
Total Nodes: 16

Want me to upgrade the QA node to 1.15 to see if memory increases? The increases memory utilization is only evident on aws-cluster-autoscaler. No attempts at 1.16 or 1.17.

mmerkes · 2020-08-31T21:18:56Z

@nitrag Sure, that would be helpful. Thought I had a repro, but was just misreading things. If you can verify that the memory increases, that would be helpful.

Basically, I created two 1.16 clusters and created managed node groups with v20200507 in one and v20200618 in the other. I installed Prometheus and a random container to spin up some pods, and the memory usage looks the same across both clusters.

Also, what specifically is the memory metric from? Want to make sure I'm using the same metric.

cshivashankar · 2020-09-01T02:36:53Z

@dza89 @nitrag @naineel @cshivashankar @sbkg0002 I'm working on reproducing this. My first attempts were unsuccessful. I'm going to keep working at it, but I have a few questions:

Is anybody experiencing this on 1.17 AMIs?

Is anybody experiencing this on the latest 1.16 AMI? 1.16.13-20200814

How many nodes are you running in your cluster?

Are you seeing this issue on 100% of your pods, including non-Prometheus pods?

If you have any other information regarding your setup that you think might be useful, let me know! If I get a repro, I'll update this issue accordingly.

@mmerkes

I am running 1.15 ,1.14,1.13 nodes
Have not used it
25
I will check and come back.

Other observation is , I am getting lot of PLEG issues after using AMI which has 19.03 docker .Till that time nodes were never flipping to NotReady state.

nitrag · 2020-09-01T03:29:18Z

@mmerkes

I am ssh'ing into the node and running docker stats to grab Memory Usage.

I just upgraded/replaced the node and it's now running v1.15.11-eks-065dce. Memory usage is 301 Mib. So 18% higher utilization. This is without any load on the application (cluster-autoscaler v1.15.7).

mmerkes · 2020-10-01T20:45:58Z

@cshivashankar I know you've been communicating on the soft lockup issue that we suspect is related to something in the Linux kernel. I'm trying to rule in or rule out a relation between the two issues as I've seen multiple customers experience both of them around the same time. More specifically, I'm curious if you've tried upgrading the kernel via amazon-linux-extras install kernel-ng and noticed any difference in this issue.

cshivashankar · 2020-10-02T04:21:45Z

Hi, @mmerkes Thanks for reaching out. Yes, soft lockup and PLEG issues are giving a lot of troubles in my environment.
I have tried upgrading the kernel in my test environment but not production. However, the soft lockup issues are only being observed in production, not test environment. Despite trying in multiple ways to simulate error in the testing environment it's not happening.So I cannot confirm if upgrading the kernel made any difference. If required I can try this in production and update.

Another weird error I observed of late was, a container was not cleaned up due to the orphaned pod and nodes were flipping from NodeReady to NodeNotReady once every 20/30 mins. When investigated I found that the container related to the orphaned pod was not responsive, What's really surprising was why Nodes were flipping every 20 mins rather than constantly Ready or NotReady.This could have happened due to issues in garbage collection or controllers not being handled well by Cgroups.
This issue might not be related but sharing my thoughts in case it helps.

mmerkes · 2020-10-02T15:33:18Z

Kernel 4.14.198-152.320.amzn2 includes patches that we hope resolves the some of the issues and is now available via yum update kernel. You could try updating the kernel and see if that fixes your issue.

This hasn't resolved the soft lockup issue, so I think it's unlikely to resolve this memory issue.

mmerkes · 2020-10-13T15:47:42Z

I'm still working on this issue, but haven't been able to reproduce, so I'd like additional information if you are willing to share. Feel free to open an AWS support case to share the information. AWS Support can share information with me directly.

Did you try updating the kernel to 5.4 as described in this soft lockup issue?

I don't know that they're related, but just curious if others are seeing both issues.

What kind of workloads are you running on the nodes? As much detail as possible would be helpful.

i.e. What pods are running in the nodes? What do they do?
High CPU usage? High memory? High IOPS?

What's your base container? Is it based on Ubuntu, Centos, etc? Do you have any specific setup?

Pod spec yaml would be super helpful, if part or all can be shared.

What EC2 instance types are you noticing this on?
Are you noticing this with all nodes or just nodes with specific pods running on it?

omnibs · 2020-10-28T14:48:04Z

I'm not sure if related, but I've been migrating Haskell Web APIs from vanilla EC2 to EKS and I'm seeing 3x+ the memory usage in Kubernetes.

Apps used to weight ~60MB and now don't live below 180MB (and often go above 600MB). Both measurements are taken from Datadog's Processes view, which probably uses ps.

I'm running 1.17 and using two AMIs:

amazon-eks-node-1.17-v20200710 w/ docker 19.3.6
amazon-eks-node-1.17-v20200710 rebuilt to use docker 18.9.9

Both are showing the same pattern.

I'm planning on upgrading to 1.18 and I can grab the new kernel then, to confirm whether this is still an issue.

mmerkes · 2020-10-28T15:50:38Z

@omnibs Thanks for the information. Are your haskell web APIs the only thing running on those nodes and if not, are your other pods noticing a spike in memory usage as well? What AMI and Docker version were you using on vanilla EC2?

Let me know if the kernel upgrade helps. Otherwise, any other details on your setup would be helpful :)

omnibs · 2020-10-30T18:38:31Z

The Haskell apps are the only thing we've migrated so far. We have Daemonsets and the like, but we don't have a base for comparison for those.

We aren't using Docker on EC2, and we're using custom AMIs we built with packer based off of ami-00a0cf500e59c9f7c (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20200521).

I'll come back with more information once I've upgraded to 1.18 and got a new kernel.

vpayala · 2020-12-11T21:28:05Z

We are seeing this issue as well after upgrading our EKS control plane and worker nodes from 1.16 to 1.17. Over a period of 2 weeks memory utilization on the worker nodes have gradually gone up on two out of three nodes to ~95%. There is no load on the env and this pattern is observed in all worker nodes in all envs, where we upgraded to 1.17.

AMI Name: amazon-eks-node-1.17-v20201112

Worker node version,

$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-x1-x-x-x.us-west-2.compute.internal    Ready    <none>   21d   v1.17.12-eks-7684af
ip-x2-x-x-x.us-west-2.compute.internal   Ready    <none>   21d   v1.17.12-eks-7684af
ip-x3-x-x-x.us-west-2.compute.internal     Ready    <none>   21d   v1.17.12-eks-7684af

Current worker node memory usage,

kubectl top node --sort-by='memory'
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-x1-x-x-x.us-west-2.compute.internal   184m         4%     14376Mi         97%
ip-x2-x-x-x.us-west-2.compute.internal    350m         8%     13854Mi         93%
ip-x3-x-x-x.us-west-2.compute.internal     182m         4%     10911Mi         73%

Worker node memory usage since upgrading to 1.17,

ghost · 2021-01-04T09:42:51Z

We've recently upgraded from 1.14 to 1.16. As we didn't spend enough time running 1.15 so I can't tell much but definitely seeing the same issue here with v1.16.15-eks-ad4801.

Has anyone here tried upgrading to 1.18?

ghost · 2021-01-11T11:41:38Z

Hey all, in my case, it looks like upgrading to 1.18 has taken care of the memory issue. I'm using v1.18.9-eks-d1db3c

There's a clear difference on memory usage before and after the upgrade.

seddarj · 2021-01-20T11:07:06Z

Update: We confirm that upgrading to AMI amazon-eks-node-1.18-v20210112 has solved the issue for us 🎉

Hello, we've also been having these issues on amazon-eks-node-1.18-v20201112 and are currently upgrading to amazon-eks-node-1.18-v20210112 (both seem to be using v1.18.9-eks-d1db3c). We'll report back if it solves the issue for us.

On our end we observed a constant increase in node RAM usage (in orange below) while the sum of RAM usage for all pods (in cyan) remained somewhat stable.

mogren mentioned this issue Aug 16, 2020

Less allocatable memory with v20191119 #387

Closed

rtripat assigned mmerkes Aug 27, 2020

mmerkes mentioned this issue Oct 6, 2020

Nodes become unresponsive and doesnt recover with soft lockup error #454

Closed

cartermckinnon closed this as completed Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher memory usage since upgrading ami #507

Higher memory usage since upgrading ami #507

dza89 commented Jul 10, 2020 •

edited by mmerkes

Loading

sbkg0002 commented Jul 16, 2020

nitrag commented Jul 29, 2020

naineel commented Aug 5, 2020

cshivashankar commented Aug 6, 2020

cshivashankar commented Aug 13, 2020

mmerkes commented Aug 31, 2020

nitrag commented Aug 31, 2020

mmerkes commented Aug 31, 2020

cshivashankar commented Sep 1, 2020

nitrag commented Sep 1, 2020

mmerkes commented Oct 1, 2020

cshivashankar commented Oct 2, 2020

mmerkes commented Oct 2, 2020 •

edited

Loading

mmerkes commented Oct 13, 2020

omnibs commented Oct 28, 2020 •

edited

Loading

mmerkes commented Oct 28, 2020

omnibs commented Oct 30, 2020 •

edited

Loading

vpayala commented Dec 11, 2020

ghost commented Jan 4, 2021

ghost commented Jan 11, 2021

seddarj commented Jan 20, 2021 •

edited

Loading

Higher memory usage since upgrading ami #507

Higher memory usage since upgrading ami #507

Comments

dza89 commented Jul 10, 2020 • edited by mmerkes Loading

Summary

Original Post

sbkg0002 commented Jul 16, 2020

nitrag commented Jul 29, 2020

naineel commented Aug 5, 2020

cshivashankar commented Aug 6, 2020

cshivashankar commented Aug 13, 2020

mmerkes commented Aug 31, 2020

nitrag commented Aug 31, 2020

mmerkes commented Aug 31, 2020

cshivashankar commented Sep 1, 2020

nitrag commented Sep 1, 2020

mmerkes commented Oct 1, 2020

cshivashankar commented Oct 2, 2020

mmerkes commented Oct 2, 2020 • edited Loading

mmerkes commented Oct 13, 2020

omnibs commented Oct 28, 2020 • edited Loading

mmerkes commented Oct 28, 2020

omnibs commented Oct 30, 2020 • edited Loading

vpayala commented Dec 11, 2020

ghost commented Jan 4, 2021

ghost commented Jan 11, 2021

seddarj commented Jan 20, 2021 • edited Loading

dza89 commented Jul 10, 2020 •

edited by mmerkes

Loading

mmerkes commented Oct 2, 2020 •

edited

Loading

omnibs commented Oct 28, 2020 •

edited

Loading

omnibs commented Oct 30, 2020 •

edited

Loading

seddarj commented Jan 20, 2021 •

edited

Loading