-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent and beats store too much ReplicaSet data in K8s #5623
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@swiatekm You wrote here
So you had at least 6500 pods running in a 3 nodes cluster? 110 pod per node is the Kubernetes limit. |
No, I had 6500 ReplicaSets, all of which were scaled down to 0. I created them by first creating 1000 Deployments with 0 replicas each, and then changing the container image in each of them a bunch of times, spawning a new ReplicaSet each time. |
From my personal experience what you say @MichaelKatsoulis is indeed true, there is such a limit. But this means that the pod will be on |
Yes you are right. The pods will be either pending or not existing if the replicas are set to 0. In the past the enrichment of a pod with the deployment name was done in a different way. For each pod, there was a direct API call to get the replicaSet of that pod and get the deployment name , and then append it to the pod. We then figured that direct API calls was not the most efficient way and we switched to creating a replicaSet watcher, so to have all replicaSet data already, so when a pod appeared, it would get the deployment name from the replicaSet data already saved in memory. Problem is, as you also mention, in Clusters with so many deployments/replicaSets. But still the approach of the replicaSet watcher is better than the direct API call. In normal scenarios there will be pods created by the deployments and if there are thousands that would lead to thousands of requests. The watcher-informer mechanism is surely better. We should keep in mind that the PRs you have created affect the kubernetes provider and the add_kubernetes_metadata processor. For the kubernetes metrics collection, there is one watcher created per resource type. So if state_replicaseSet datastream is enabled (it is by default) it will start replicaSet watcher, which will collect everything. But this is the intention as it adds all replicaSet metadata to the events. Unlike the replicaSet watcher from the kubernetes provider and processor which start the watcher just for the deployment name. So in the end, there could be as many as 3 replicaSet watchers that get started:
Additionally, the approach of selecting what to keep from the watcher's data could also be used for cronJobs. In environments with too many cronJobs, the pods generated are enhanced with CronJob name by starting a Job watcher. It gets started just to get the CronJob name. We have also set |
Thanks for the explanation @MichaelKatsoulis, what you wrote is also consistent with what I've learned digging through the beats codebase over the past two weeks. And I agree that we can use the same approach for Jobs. I'm doing it for ReplicaSets because this is an issue that currently affects us internally, and I didn't want to complicate things by making another change at the same time.
I think it's fine to limit this watcher to only metadata as well. It can't use the default transform function I added, as it needs labels and annotations as well. But it doesn't need data from the ReplicaSet spec - that is already present in the metric samples collected from kube-state-metrics. |
FYI I opened this issue for Jobs watcher |
This is now fixed in both agent and beats, in every maintained 8.x branch. Closing. |
When Deployment metadata is enabled, either in the agent K8s provider or in beats processors, agent and beats keep a local cache of ReplicaSet data. The only part of this cache they actually need is the ReplicaSet name and owner references, so they can connect the Pod to the Deployment.
In small to medium clusters this doesn't make that big a difference. In large clusters, however, you can have quite a few ReplicaSets - up to 10 for each Deployment, by default. This is compounded by the fact that we keep multiple copies of this data:
I strongly suspect this is the primary root cause of the issue reported by our SRE Team in #4729, where elastic-agent is approaching 5 Gi of memory usage in a cluster with ~75k ReplicaSets.
This issue was split off from #4729 to avoid confusing it with an unrelated issue where agent uses too much memory on Pod data.
Data
Production
Below is a heap profile of agent running in the aforementioned cluster, provided by @henrikno in #4729 (comment):
If you also look at the linked output of
ps
in the container, you can see elastic-agent and the metricbeat collecting k8s metrics using a lot more memory than all the other processes.Test cluster
I created a test environment in a local kind cluster on 3 Nodes, where I manually created 6500 ReplicaSets. I then tested the same elastic-agent workload, using the default standalone manifests, with
deployment
metadata additionally enabled in the kubernetes provider. I also setGOGC
to 25 to make it easier to see the difference in actual memory usage. Finally, I built an elastic-agent image with all the fixes applied locally. The following shows the difference in memory usage, as measured by thesystem.process.memory.size
metric:Fix
Since the fix will require changes in at least 3 different components across two repositories, most of it should happen in https://github.com/elastic/elastic-agent-autodiscover. It'll require three major changes:
Subsequently, we'll need to use these new components in both agent and beats. A PoC of how this may look like in agent, with all the autodiscovery customizations, can be found here: #5580.
The text was updated successfully, but these errors were encountered: