High memory consumption with v1.25.2 #3443

smartaquarius10 · 2023-01-31T09:33:46Z

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.

Regards,
Tanul

My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment.
Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

Dameon set of AAD pod identity:- Taking total 191 Mi of memory

Total 2 pods of kong :- Taking total 914 Mi Memory

Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory

Total 10 pods of our .net microservices:- Taking total 820 Mi of memory

The text was updated successfully, but these errors were encountered:

xuanra · 2023-02-01T12:24:19Z

Hello,
We have the same problem with 1.25.4 version in our Company AKS.

We are trying to upgrade an app to openjdk17 to check if this new LTS Java version mitigates the problem.

Edit: In our case, .Net apps needed to change the nugget package for Application Insights.

Greets,

smartaquarius10 · 2023-02-01T14:46:25Z

@xuanra , My major pain point is these 2 pods out of 9 of them

ama-logs
ama-logs-rs
They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

csi-azuredisk-node
csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

lsavini-orienteed · 2023-02-02T08:56:58Z

Hello,
we are facing the same problem of memory spikes moving from v1.23.5 to v1.25.4.
We had to increase the memory limit of most of the containers

smartaquarius10 · 2023-02-02T09:21:14Z

@miwithro @ritazh @Karishma-Tiwari-MSFT @CocoWang-wql @jackfrancis @mainred

Hello,

Extremely sorry for tagging you. But our whole non prod environment is not working. We haven't upgraded our prod environment yet. However, engineers are unable to work on their applications.

Few days back, we have approached customer support for node performance issues but did not get any good response.

Would be really grateful for help and support on this as it seems to be a global problem.

smartaquarius10 · 2023-02-02T12:12:46Z

I need to share one finding. I have just created 2 different AKS clusters with v1.24.9 and v1.25.4 with 1 nodes of Standard B2s

These are the metrics. In case of v 1.25.4 there is a huge spike after enabling monitoring.

cedricfortin · 2023-02-03T07:14:22Z

We've got the same problem with memory after upgrading AKS from version 1.24.6 to 1.25.4:

In the monitoring of memory for the last month of one of our deployment, we can clearly see the memory usage increase after the update (01/23):

xuanra · 2023-02-03T11:04:37Z

Hello,
Our cluster has D4s_v3 machines.
We still haven't found any patron in the apps that raised the memory demanded and the apps they don't between all our Java and .Net pods.
One alternative to upload Java from 8 to 17 that one of our providers told us is to upload the version of our VM from D4s_v3 to D4s_v5 and we are studing the impact of this change.

Greets,

smartaquarius10 · 2023-02-06T09:04:38Z

@xuanra , I think in that case B2s are totally out of picture for this upgrade.. The max they are capable of supporting is till 1.24.x version of AKS

ganga1980 · 2023-02-08T02:20:48Z

@xuanra , My major pain point is these 2 pods out of 9 of them

ama-logs

ama-logs-rs
They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

csi-azuredisk-node

csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

Hi, @smartaquarius10 , Thanks for the feedback. We have work planned to reduce the ama-logs agent memory foot print and we will update the exact timelines and additional details of the improvements in early March. cc: @pfrcks

smartaquarius10 · 2023-02-13T08:04:47Z

@ganga1980 @pfrcks

Thank you so much Ganga.. We are heavily impacted because of this. Till 1.24.x version of AKS we were running 3 environments within our AKS. But, after upgrading to 1.25.x version we are unable to manage even 1 environment.

Each environment has 11 pods.

Would be grateful for your support on this. I have already disabled the csi pods as we are not using any storage. For now, should we disable these ama monitoring pods as well..

If yes, then once your team resolve these issues should we upgrade our AKS again to some specific version or microsoft will resolve from backend in every version of AKS infra.

Thank you

Kind Regards,
Tanul

smartaquarius10 · 2023-02-24T07:52:08Z

Hello @ganga1980 @pfrcks ,

Hope you are doing well. By any chance, is it possible to speed up the process a little.. Actually our 2 environments (which is 22 micro services) are down because of this.

Appreciate your help and support in this matter. Thank you. Have a great day.

Hello @xuanra @cedricfortin @lsavini-orienteed,
Did you find any workaround for this. Thanks :)

Kind Regards,
Tanul

gonpinho · 2023-02-24T09:53:26Z

Hi @smartaquarius10, we updated the k8s version of AKS to 1.25.5 this week and start suffering from the same issue.

In our case, we identified a problem with the JRE version when dealing with cgroups v2. Here I share my findings:

Kubernetes cgroups v2 reached GA on the version 1.25.x and with this change AKS changed the OS of the nodes from Ubuntu18.04 to Ubuntu22.04 that already uses cgroups v2 by default.

The problem of our containarized apps were related with a bug on JRE 11.0.14. This JRE didn't had support for cgroups v2 container awareness. This means that the container were not able to respect the imposed memory quotas defined on the deployment descriptor.

Oracle and OpenJDK addressed this issue by supporting it natively on JRE 17 and backporting this fix to JRE 15 and JRE 11.0.16++.

I've updated the base image to use a fixed JRE version (11.0.18) and the memory exhaustion was solved.

Regarding AMA pods, I've compared the pods running on k8s 1.25.X with the pods running on 1.24.X and in my opinion seems stable as the memory footprint is literally the same.

Hope this helps!

smartaquarius10 · 2023-02-24T10:38:02Z

@gonpinho , Thanks a lot for sharing the details. But the problem is that our containerized apps are not taking extra memory.. They are still occupying the same as they were taking before with 1.24.x..

What I realized is that I have created a fresh cluster 1.24.x and 1.25.x and by default memory occupancy is appox. 30% more in 1.25.x..

My one environment takes only 1 GB of memory consisting of 11 pods.. With AKS 1.24.x I was running 3 environments in total. The moment I shifted to 1.25.x I have to disable 2 environments along with the microsoft CSI addons as well just to accommodate the 11 custom pods because the node memory consumption is already high.

smartaquarius10 · 2023-02-24T10:41:00Z

@gonpinho , By any chance if I can downgrade the OS again to ubuntu 18.0.4 then it would be my first preference. I know that upgrade to ubuntu OS is killing the machines. No idea how to handle this.

pintmil · 2023-03-02T09:39:23Z

Hi, we facing with the same problem after upgrading our dev AKS cluster to 1.25.5 from 1.23.12. Our company develops c/c++ and c# services, so we don't suffer from JRE cgroup v2 issues. We see that memory usage is increasing over time, but nothing - just kube-system pods - are running on the cluster.
The sympthoms is that kubectl top no shows much more memory consumption than free on the host OS (ubuntu 22.04). If we force host OS to drop cached memory with the command sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches' the used memory isn't changing but some of the buff/cache memory moves to free, and after it the kubectl top no shows memory usage drop on that node.
We came to conclusion, that k8s calculates buff/cache memory into used memory, but it is wrong, because linux OS will use free memory to buffer IO and other things, and it is completely normal operation.

kubectl top no before cache drop:

free before / after cache drop:

kubectl top no after cache drop:

shiva-appani-hash · 2023-03-04T21:58:44Z

Team, we are seeing the same behaviour after upgrading the cluster from 1.23.12 to 1.25.5. All the microservices running in clusters are .Net3.1. On raising a support request, we got to know that cgroup version has been changes to v2, does anyone have similar scenario.
How do we identify cgroup v1 is used in .net 3.1 and can it be the cause for high memory consumption,

smartaquarius10 · 2023-03-06T11:55:46Z

Hello @ganga1980, Any update on this please.. Thank you

ganga1980 · 2023-03-07T05:12:53Z

Hello @ganga1980, Any update on this please.. Thank you
@smartaquarius10 , We are working on rolling out our March agent release, which would bring down the usage ama-logs daemonset (linux) by 80 to 100MB. I dont have your cluster name or cluster resource id to investigate and we cant repro the issue you have reported. Please create an support ticket with clusterResourceId details so that we can investigate.
The workaround you can try applying the default configmap through kubectl apply -f https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml

smartaquarius10 · 2023-03-07T07:00:56Z

@ganga1980 , Thank you for the reply. Just a quick question. After raising the support ticket should I send a mail to you on your microsoft id with the details regarding support ticket. Otherwise, it will assign to L1 support which will take a lot of time to get to the resolution.

Or else, if you allow, I can ping you my cluster details on MS teams.

The way you like 😃

Currently, ama pods are taking approx. 326Mi of memory/node

smartaquarius10 · 2023-03-07T07:03:50Z

@ganga1980, We already have this config map

andyzhangx · 2023-03-07T13:08:11Z

@ganga1980 for the csi driver resource usage, if you don't need csi driver, you could disable those drivers, follow by: https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers#disable-csi-storage-drivers-on-a-new-or-existing-cluster

Marchelune · 2023-03-10T14:06:02Z

Hi! It seems we are facing the same issue in 1.25.5. We upgraded a few weeks (24.02) ago and the memory usage (container working set memory) jumped from the moment of the upgrade, according to the metrics tab:

We are using Standard_B2s vms, as this is an internal development cluster - csi drivers are not enabled.
Is the issue identified or is there still an investigation on this?

codigoespagueti · 2023-03-10T17:58:02Z

Same issue here after upgrading to 1.25.5.
We are using FS2_v2 and we were not able to have the Working Set memory below 100% no matter how many nodes we added to the cluster.

Very dissapointing that all the memory in the Node is used and reserved by Azure Pods.

We had to disable Azure Insights in the cluster.

ghost · 2023-03-10T18:19:29Z

@vishiy, @saaror would you be able to assist?

Issue Details

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.

My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment.
Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

Dameon set of AAD pod identity:- Taking total 191 Mi of memory

Total 2 pods of kong :- Taking total 914 Mi Memory

Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory

Total 10 pods of our .net microservices:- Taking total 820 Mi of memory

Author:	smartaquarius10
Assignees:	-
Labels:	`bug`, `azure/oms`, `addon/container-insights`
Milestone:	-

microsoft-github-policy-service · 2024-03-11T16:05:17Z

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service · 2024-03-26T19:59:21Z

Issue needing attention of @Azure/aks-leads

Oceanswave · 2024-04-04T18:32:29Z

We're seeing this as well, and the ama-metrics and ama-logs pods are hitting their AKS configured memory limits, getting terminated and restarting.

We've got 4,800+ entries of ama-metrics-operator-* terminations in the past week. Any advice or recommendations here would be useful.

microsoft-github-policy-service · 2024-04-19T20:20:58Z

Issue needing attention of @Azure/aks-leads

smartaquarius10 · 2024-04-22T14:39:12Z

Closing this issue.

Marchelune · 2024-04-22T21:21:51Z

I'm sorry, I may have missed a development of this issue, but is the high memory consumption reporting problem resolved now ?

mhkolk · 2024-04-23T07:26:40Z

I'm sorry, I may have missed a development of this issue, but is the high memory consumption reporting problem resolved now ?

The problem is not resolved, we are seeing this issue, high memory consumption from kube-system at ama-metrics nodes as we speak even after disabling metrics the way @marekr described.

marcindulak · 2024-04-23T08:12:33Z

Please reopen @tanulbh - people piggybacked on your issue report.

gyorireka · 2024-05-08T14:19:32Z

@smartaquarius10 could you please update us?

smartaquarius10 · 2024-05-28T05:53:17Z

@marcindulak @gyorireka Sure.. Reopening the issue.

deyanp · 2024-06-05T11:00:53Z

I am seeing on a 3-node AKS cluster this:

NAME                                            CPU(cores)   MEMORY(bytes)   
ama-logs-4vmcz                                  4m           185Mi           
ama-logs-9f4r9                                  3m           199Mi           
ama-logs-jc7cr                                  3m           198Mi           
ama-logs-rs-794b9b5b76-k5nr7                    7m           250Mi           
ama-metrics-5bf4d7dcc8-sg6cq                    14m          215Mi           
ama-metrics-ksm-d9c6f475b-bf94k                 2m           40Mi            
ama-metrics-node-kcph9                          9m           269Mi           
ama-metrics-node-r6c4v                          12m          212Mi           
ama-metrics-node-s8j8l                          12m          204Mi           
ama-metrics-operator-targets-7c4bf58f46-7c64j   1m           38Mi

and 200-300Mi multiplied by all the pods is too much as a whole only for pushing logs or metrics out ...

microsoft-github-policy-service · 2024-06-20T17:11:31Z

Issue needing attention of @Azure/aks-leads

smartaquarius10 · 2024-06-24T20:06:54Z

I think now we cannot use B series machines with AKS.

microsoft-github-policy-service · 2024-07-10T00:44:22Z

Issue needing attention of @Azure/aks-leads

EvertonSA · 2024-07-24T15:32:47Z

in case it helps, we are on 1.30.0 running majority CBL-Mariner nodes

this is a dev cluster, not much happening here as we are using another log solution (grafana loki)

it seems image used is mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.22

open telemetry is not enabled.

microsoft-github-policy-service · 2024-08-08T17:44:30Z

Issue needing attention of @Azure/aks-leads

monotek · 2024-08-12T08:54:47Z

Seems back to normal with the 1.29 update.

microsoft-github-policy-service · 2024-08-27T11:44:25Z

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service · 2024-08-29T16:53:21Z

@ganga1980, @saaror would you be able to assist?

brgrz · 2024-09-25T07:55:29Z

in case it helps, we are on 1.30.0 running majority CBL-Mariner nodes

this is a dev cluster, not much happening here as we are using another log solution (grafana loki)

it seems image used is mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.22

open telemetry is not enabled.

100% can confirm these numbers and they are insane. Dozens of AMA pods consuming literally GBs of memory (on 8 and 16 GBs node VMs) bc of which we've had constant memory pressure on our nodes.

no solution from MS, not even solution from our dedicated support so I had enough and did this:

Disable Managed Prometheus

az aks update --disable-azure-monitor-metrics -n <<namespace>> -g <<resource-group>>
Disable Container insights

az aks disable-addons -a monitoring -n <<namespace>> -g <<resource-group>>

Memory consumption on all nodes went down -20% and I just cut our Azure Log Analytics costs by a couple hundred euros per month. We'll deploy standalone Grafana and Loki to do this instead of the managed solution.

deyanp · 2024-09-26T05:48:48Z

ama-logs pods are clearly too memory and cpu hungry for absolutely no reason (low volume clusters). Probably written in .Net, and probably not well-written. Compare this with other infrastructure pods written in Go and consuming single digit CPU and double digit memory ... saying this as a .Net developer ....

smartaquarius10 added the bug label Jan 31, 2023

OmegaVVeapon mentioned this issue Feb 17, 2023

[bitnami/spring-cloud-dataflow] Bump JDK versions for Spring Cloud Dataflow images bitnami/containers#24449

Closed

smartaquarius10 mentioned this issue Mar 6, 2023

Dotnet core consuming lot of memory? dotnet/runtime#79287

Closed

1 task

nemobis mentioned this issue Mar 9, 2023

[BUG] kube-system pods reserve 35 % of allocatable memory on a 4 GB node #3525

Closed

palma21 added azure/oms Container Health addon/container-insights labels Mar 10, 2023

smartaquarius10 closed this as completed Apr 22, 2024

smartaquarius10 reopened this May 28, 2024

brgrz mentioned this issue Jul 15, 2024

Azure Monitor pods memory consumption issue microsoft/Docker-Provider#1293

Closed

AllenWen-at-Azure added the addon/container-insights label Aug 29, 2024

microsoft-github-policy-service bot assigned ganga1980 Aug 29, 2024

microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Aug 29, 2024

microsoft-github-policy-service bot added the action-required label Sep 23, 2024

eeowaa mentioned this issue Oct 1, 2024

Azure Linux nodes will default to using cgroupsv2 starting AKS 1.29 #4495

Closed

High memory consumption with v1.25.2 #3443

High memory consumption with v1.25.2 #3443

Comments

smartaquarius10 commented Jan 31, 2023 • edited Loading

xuanra commented Feb 1, 2023 • edited Loading

smartaquarius10 commented Feb 1, 2023

lsavini-orienteed commented Feb 2, 2023

smartaquarius10 commented Feb 2, 2023 • edited Loading

smartaquarius10 commented Feb 2, 2023 • edited Loading

cedricfortin commented Feb 3, 2023

xuanra commented Feb 3, 2023 • edited Loading

smartaquarius10 commented Feb 6, 2023 • edited Loading

ganga1980 commented Feb 8, 2023

smartaquarius10 commented Feb 13, 2023 • edited Loading

smartaquarius10 commented Feb 24, 2023 • edited Loading

gonpinho commented Feb 24, 2023

smartaquarius10 commented Feb 24, 2023 • edited Loading

smartaquarius10 commented Feb 24, 2023

pintmil commented Mar 2, 2023 • edited Loading

shiva-appani-hash commented Mar 4, 2023

smartaquarius10 commented Mar 6, 2023

ganga1980 commented Mar 7, 2023

smartaquarius10 commented Mar 7, 2023 • edited Loading

smartaquarius10 commented Mar 7, 2023 • edited Loading

andyzhangx commented Mar 7, 2023

Marchelune commented Mar 10, 2023

codigoespagueti commented Mar 10, 2023

ghost commented Mar 10, 2023

microsoft-github-policy-service bot commented Mar 11, 2024

microsoft-github-policy-service bot commented Mar 26, 2024

Oceanswave commented Apr 4, 2024

microsoft-github-policy-service bot commented Apr 19, 2024

smartaquarius10 commented Apr 22, 2024

Marchelune commented Apr 22, 2024

mhkolk commented Apr 23, 2024

marcindulak commented Apr 23, 2024

gyorireka commented May 8, 2024

smartaquarius10 commented May 28, 2024

deyanp commented Jun 5, 2024

microsoft-github-policy-service bot commented Jun 20, 2024

smartaquarius10 commented Jun 24, 2024

microsoft-github-policy-service bot commented Jul 10, 2024

EvertonSA commented Jul 24, 2024

microsoft-github-policy-service bot commented Aug 8, 2024

monotek commented Aug 12, 2024

microsoft-github-policy-service bot commented Aug 27, 2024

microsoft-github-policy-service bot commented Aug 29, 2024

brgrz commented Sep 25, 2024 • edited Loading

deyanp commented Sep 26, 2024

smartaquarius10 commented Jan 31, 2023 •

edited

Loading

xuanra commented Feb 1, 2023 •

edited

Loading

smartaquarius10 commented Feb 2, 2023 •

edited

Loading

smartaquarius10 commented Feb 2, 2023 •

edited

Loading

xuanra commented Feb 3, 2023 •

edited

Loading

smartaquarius10 commented Feb 6, 2023 •

edited

Loading

smartaquarius10 commented Feb 13, 2023 •

edited

Loading

smartaquarius10 commented Feb 24, 2023 •

edited

Loading

smartaquarius10 commented Feb 24, 2023 •

edited

Loading

pintmil commented Mar 2, 2023 •

edited

Loading

smartaquarius10 commented Mar 7, 2023 •

edited

Loading

smartaquarius10 commented Mar 7, 2023 •

edited

Loading

brgrz commented Sep 25, 2024 •

edited

Loading