Agent and beats store too much ReplicaSet data in K8s #5623

swiatekm · 2024-09-30T10:30:58Z

When Deployment metadata is enabled, either in the agent K8s provider or in beats processors, agent and beats keep a local cache of ReplicaSet data. The only part of this cache they actually need is the ReplicaSet name and owner references, so they can connect the Pod to the Deployment.

In small to medium clusters this doesn't make that big a difference. In large clusters, however, you can have quite a few ReplicaSets - up to 10 for each Deployment, by default. This is compounded by the fact that we keep multiple copies of this data:

in the Kubernetes provider in elastic-agent
in the add_kubernetes_metadata processor in beats
in the metricbeat kubernetes module

I strongly suspect this is the primary root cause of the issue reported by our SRE Team in #4729, where elastic-agent is approaching 5 Gi of memory usage in a cluster with ~75k ReplicaSets.

This issue was split off from #4729 to avoid confusing it with an unrelated issue where agent uses too much memory on Pod data.

Data

Production

Below is a heap profile of agent running in the aforementioned cluster, provided by @henrikno in #4729 (comment):

If you also look at the linked output of ps in the container, you can see elastic-agent and the metricbeat collecting k8s metrics using a lot more memory than all the other processes.

Test cluster

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets. I then tested the same elastic-agent workload, using the default standalone manifests, with deployment metadata additionally enabled in the kubernetes provider. I also set GOGC to 25 to make it easier to see the difference in actual memory usage. Finally, I built an elastic-agent image with all the fixes applied locally. The following shows the difference in memory usage, as measured by the system.process.memory.size metric:

Fix

Since the fix will require changes in at least 3 different components across two repositories, most of it should happen in https://github.com/elastic/elastic-agent-autodiscover. It'll require three major changes:

Letting ReplicaSet (and Job, while we're at it) metadata generators use PartialObjectMetadata instead of the full resource (Allow ReplicaSet and Job metadata generators to use partial meta objects elastic-agent-autodiscover#109)
Adding a special metadata-only watcher (Add metadata watcher and informer elastic-agent-autodiscover#111)

Subsequently, we'll need to use these new components in both agent and beats. A PoC of how this may look like in agent, with all the autodiscovery customizations, can be found here: #5580.

Use the metadata watcher in elastic-agent k8s provider (Only watch metadata for ReplicaSets in K8s provider #5699)
Use the metadata watcher in the add_kubernetes_metadata processor in beats (Only watch metadata for ReplicaSets in K8s beats#41100)
Use the metadata watcher in the metricbeat kubernetes module (Only watch metadata for ReplicaSets in metricbeat k8s module beats#41289)
Use the metadata watcher in beats k8s autodiscovery (Only watch metadata for ReplicaSets in K8s beats#41100)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-10-02T19:30:35Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

MichaelKatsoulis · 2024-10-08T10:30:38Z

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

swiatekm · 2024-10-08T10:45:49Z

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

No, I had 6500 ReplicaSets, all of which were scaled down to 0. I created them by first creating 1000 Deployments with 0 replicas each, and then changing the container image in each of them a bunch of times, spawning a new ReplicaSet each time.

pkoutsovasilis · 2024-10-08T10:46:50Z

@swiatekm You wrote here

I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets.

So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit.

From my personal experience what you say @MichaelKatsoulis is indeed true, there is such a limit. But this means that the pod will be on Pending state indeed but the corresponding events through the informer will still "flow" through as the object will be created in the API server. Hence, the "workload" can be considered equal of having 6500 replicasets going to the elastic-agent through the informer?!

MichaelKatsoulis · 2024-10-08T11:24:16Z

Yes you are right. The pods will be either pending or not existing if the replicas are set to 0.
But the deployments and replicaSets will be created and the watcher will collect them.

In the past the enrichment of a pod with the deployment name was done in a different way. For each pod, there was a direct API call to get the replicaSet of that pod and get the deployment name , and then append it to the pod.

We then figured that direct API calls was not the most efficient way and we switched to creating a replicaSet watcher, so to have all replicaSet data already, so when a pod appeared, it would get the deployment name from the replicaSet data already saved in memory.

Problem is, as you also mention, in Clusters with so many deployments/replicaSets. But still the approach of the replicaSet watcher is better than the direct API call. In normal scenarios there will be pods created by the deployments and if there are thousands that would lead to thousands of requests. The watcher-informer mechanism is surely better.

We should keep in mind that the PRs you have created affect the kubernetes provider and the add_kubernetes_metadata processor.
The kubernetes provider is used for log collection and enhancement of pods and containers with metadata.
The add_kubernetes_metadata processor is not used at all when Kubernetes Integration is installed. It is started by default, but the data it collects are later dropped, as the data are already there. Unfortunately it is started by default and cannot be switched off.

For the kubernetes metrics collection, there is one watcher created per resource type. So if state_replicaseSet datastream is enabled (it is by default) it will start replicaSet watcher, which will collect everything. But this is the intention as it adds all replicaSet metadata to the events. Unlike the replicaSet watcher from the kubernetes provider and processor which start the watcher just for the deployment name.

So in the end, there could be as many as 3 replicaSet watchers that get started:

In each elastic-agent due to the provider (If deployment:true is set in the add_resource_metadata config)
In each elastic-agent due to the add_kubernetes_metadata processor (The feature deployment:false is default now, so it will actually not start a replicaSet watcher)
In leader elastic-agent due to the state_replicaset datastream

Additionally, the approach of selecting what to keep from the watcher's data could also be used for cronJobs. In environments with too many cronJobs, the pods generated are enhanced with CronJob name by starting a Job watcher. It gets started just to get the CronJob name. We have also set cronjob: false by default in the add_resource_metadata config to mitigate memory issues.
Probably we can re use parts of the code of the PR in elastic-agent-autodiscover lib to tackle this as well.

swiatekm · 2024-10-08T13:04:34Z

Thanks for the explanation @MichaelKatsoulis, what you wrote is also consistent with what I've learned digging through the beats codebase over the past two weeks. And I agree that we can use the same approach for Jobs. I'm doing it for ReplicaSets because this is an issue that currently affects us internally, and I didn't want to complicate things by making another change at the same time.

For the kubernetes metrics collection, there is one watcher created per resource type. So if state_replicaseSet datastream is enabled (it is by default) it will start replicaSet watcher, which will collect everything. But this is the intention as it adds all replicaSet metadata to the events. Unlike the replicaSet watcher from the kubernetes provider and processor which start the watcher just for the deployment name.

I think it's fine to limit this watcher to only metadata as well. It can't use the default transform function I added, as it needs labels and annotations as well. But it doesn't need data from the ReplicaSet spec - that is already present in the metric samples collected from kube-state-metrics.

MichaelKatsoulis · 2024-10-15T09:29:42Z

FYI I opened this issue for Jobs watcher
#5788

swiatekm · 2024-10-21T08:28:13Z

This is now fixed in both agent and beats, in every maintained 8.x branch. Closing.

swiatekm self-assigned this Sep 30, 2024

swiatekm mentioned this issue Sep 30, 2024

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Closed

3 tasks

jlind23 mentioned this issue Sep 30, 2024

Reduce the amount of stored ReplicaSet data #5580

Closed

7 tasks

cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 2, 2024

This was referenced Oct 3, 2024

Only watch metadata for ReplicaSets in K8s elastic/beats#41100

Merged

Allow ReplicaSet and Job metadata generators to use partial meta objects elastic/elastic-agent-autodiscover#109

Merged

Only watch metadata for ReplicaSets in K8s provider #5699

Merged

This was referenced Oct 8, 2024

[8.15](backport #5699) Only watch metadata for ReplicaSets in K8s provider #5733

Merged

[8.x](backport #5699) Only watch metadata for ReplicaSets in K8s provider #5734

Merged

swiatekm mentioned this issue Oct 14, 2024

K8s metadata for metricbeat Kubernetes module missing at startup elastic/beats#41213

Closed

This was referenced Oct 14, 2024

[8.15](backport #41100) Only watch metadata for ReplicaSets in K8s elastic/beats#41214

Merged

[8.x](backport #41100) Only watch metadata for ReplicaSets in K8s elastic/beats#41215

Merged

MichaelKatsoulis mentioned this issue Oct 15, 2024

[Kubernetes Provider] Only store necessary metadata for Jobs #5788

Open

swiatekm mentioned this issue Oct 17, 2024

Only watch metadata for ReplicaSets in metricbeat k8s module elastic/beats#41289

Merged

4 tasks

swiatekm closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent and beats store too much ReplicaSet data in K8s #5623

Agent and beats store too much ReplicaSet data in K8s #5623

swiatekm commented Sep 30, 2024 •

edited

Loading

elasticmachine commented Oct 2, 2024

MichaelKatsoulis commented Oct 8, 2024

swiatekm commented Oct 8, 2024

pkoutsovasilis commented Oct 8, 2024 •

edited

Loading

MichaelKatsoulis commented Oct 8, 2024

swiatekm commented Oct 8, 2024

MichaelKatsoulis commented Oct 15, 2024

swiatekm commented Oct 21, 2024

Agent and beats store too much ReplicaSet data in K8s #5623

Agent and beats store too much ReplicaSet data in K8s #5623

Comments

swiatekm commented Sep 30, 2024 • edited Loading

Data

Production

Test cluster

Fix

elasticmachine commented Oct 2, 2024

MichaelKatsoulis commented Oct 8, 2024

swiatekm commented Oct 8, 2024

pkoutsovasilis commented Oct 8, 2024 • edited Loading

MichaelKatsoulis commented Oct 8, 2024

swiatekm commented Oct 8, 2024

MichaelKatsoulis commented Oct 15, 2024

swiatekm commented Oct 21, 2024

swiatekm commented Sep 30, 2024 •

edited

Loading

pkoutsovasilis commented Oct 8, 2024 •

edited

Loading