-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU increase and memory leak #19125
Comments
This does look like a bug. Are you able to upgrade to version 0.34.0? There were a couple of memory leaks that were handled in recent versions that may resolve your issue. Also, do you have any metrics that could be having unbounded cardinality? This could cause memory growth through the |
@bruceg thanks. |
Hmm, I keyed on the Prometheus component in the config above and misread it as a source (so used to reading sources first). I see you don't have any metric sources other than |
Thanks, yes, an upgrade is certainly an option, we'll try it |
Hey guys, we have the same issue. By the way, I found very strange behavior on the internal_metrics sink. Please take a look:
Seems the issue with
The metrics are:
We do have an F5 loadbalancer for each vector instance. So, in these metrics, we will have only a set of F5's IPs and a bunch of ports. |
Hello @tomer-epstein, can you try |
Hello @pront |
Seeing this issue in 0.28 version, will try to upgrade. Will disabling internal metrics solves the issue? |
Hi in addition to @tomer-epstein comment we also found that the source problem could be relate to the metric vector expose after disabling the metric the pods seem to decrease their memory and cpu consumption. removed from vector config: sources :
internal_metrics:
type: internal_metrics
scrape_interval_secs: 60
sinks:
prom_exporter:
type: prometheus_exporter
inputs:
- internal_metrics
address: 0.0.0.0:9598
after we removed the metric config we saw vector pod resources stabilized is it a config problem or it may be a vector memory leak? |
Do you see the cardinality of the metrics exposed by the This could be due to some components publishing telemetry with unbounded cardinality (like the There is also https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs which you can use to expire stale metric contexts. |
I did some investigation into this, and my findings agree with @jszwedko’s comment above. I also confirm using In our case, a few nodes in the cluster had much heavier pod churn than the rest of the nodes, and the vector agents we noticed with significantly increasing CPU and mem usage were exclusively on those nodes. One feature I noticed with internal_metrics is that the I expected this might cause extra load on prometheus, but not vector itself. However, it appears Vector’s default behavior (at least in v0.31) is that metrics will continue to emit for every pod that has ever existed on its node. I could see this reflected in the rate at which events were sent to from internal_metrics to the pod’s prometheus exporter server. Rather than being constant, this rate was increasing. I believe this is the cause of the memory (and CPU) increasing over time; the amount of data sent from the To fix: I set the (by default unset) global option expire_metrics_secs, as suggested by @jszwedko. I also created a transform that dropped the pod_name tag from the metrics that feature it. (I also made an analogous transform for the metrics which have one Mem utilization before and after: CPU utilization before and after: (In both of these graphs, the very steep line before the fix is the node with very high pod churn) |
Interesting! I am curious what value did you set it to? @jszwedko: I also wonder if we want to consider a breaking change and always set |
@epandolfo-plaid thanks for sharing your info. It adds a lot of value. Can you please share the topology you mentioned in I believe the input isn't the |
No problem! The input of the transform is |
we found that the |
A note for the community
Problem
Hello Vector team
We see CPU increasement and a memory leak only in one of the agent pods.
We are using the agent as DaemonSets --> a vector pod for each node.
We are using the aggregator as StatefulSets.
The agent is sending the k8s logs to the aggregator.
Is that a bug? or did we configure it wrong?
Configuration
Version
0.32.1-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: