Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU increase and memory leak #19125

Open
tomer-epstein opened this issue Nov 12, 2023 · 21 comments
Open

CPU increase and memory leak #19125

tomer-epstein opened this issue Nov 12, 2023 · 21 comments
Labels
type: bug A code related bug.

Comments

@tomer-epstein
Copy link

tomer-epstein commented Nov 12, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello Vector team
We see CPU increasement and a memory leak only in one of the agent pods.
We are using the agent as DaemonSets --> a vector pod for each node.
We are using the aggregator as StatefulSets.
The agent is sending the k8s logs to the aggregator.
Is that a bug? or did we configure it wrong?

image

Configuration

affinity: {}
  args:
  - -w
  - --config-dir
  - /etc/vector/
  - --log-format
  - json
  - -vv
  autoscaling:
    behavior: {}
    customMetric: {}
    enabled: false
    maxReplicas: 10
    minReplicas: 1
    targetCPUUtilizationPercentage: 80
  command: []
  commonLabels: {}
  containerPorts: []
  customConfig:
    api:
      enabled: false
    data_dir: /vector-data-dir
    sinks:
      prom_exporter:
        address: 0.0.0.0:9598
        inputs:
        - internal_metrics
        type: prometheus_exporter
      splunk_hec_dev_test:
        buffer:
          type: memory
        compression: gzip
        default_token: cf69e945-a1b2-a1b2-a1b2-be6777bea1b2
        encoding:
          codec: json
        endpoint: https://hec.a1b2.a234.log.cde.net.abc:443
        index: ugw_statistics
        inputs:
        - transform_remap_gateway_proxy_accesslog_test
        type: splunk_hec_logs
      vector_aggregator:
        address: abc-efgh-vector-aggregator.vector-aggregator.svc.cluster.local:7500
        inputs:
        - kubernetes_logs
        type: vector
    sources:
      internal_metrics:
        scrape_interval_secs: 2
        type: internal_metrics
      kubernetes_logs:
        glob_minimum_cooldown_ms: 1000
        max_read_bytes: 8192
        type: kubernetes_logs
    transforms:
      transform_remap_gateway_proxy_accesslog:
        inputs:
        - transform_route_gateway_proxy.access_log
        source: |
          ., err = parse_json(.message)
          if (err != null) {
            log("Remap gateway-proxy access_log, Unable to parse json: " + err, "error")
            abort
          }
          if (!exists(.targetLog) || .targetLog != "ABC-API-INSIGHTS-ACCESS-LOG") {
            abort
          }
          del(.targetLog)
          # verify insightsFields is not empty
          if (!exists(.insightsFields) || .insightsFields == null || .insightsFields == "") {
            log("Remap gateway-proxy access_log, insightsFields field is missing.")
            abort
          }
          result, err = merge(., .insightsFields)
          if (err != null) {
              log("Remap gateway-proxy access_log, Unable to merge insightsFields: " + err, "error")
          } else {
            . = result
          }
          del(.insightsFields)
          if (!exists(.insightsProfile) || .insightsProfile == null || .insightsProfile == "") {
            log("Remap gateway-proxy access_log, insightsProfile field is missing.")
            abort
          }
          path_parts, err = parse_regex(.path, r'(?P<pathname>[^?#]*)(?P<query>.*)')
          if (err != null) {
            log("Remap gateway-proxy access_log, Unable to parse regex: " + err, "error")
            abort
          }
          .path = path_parts.pathname
          if (.serviceTime != null) {
            .serviceTime = to_int!(.serviceTime)
          }
          if (.jwtZid != null) {
            .spcTenantId = .jwtZid
          }
          del(.jwtZid)
          if (.abcaaTenantId != null) {
            .spcTenantId = .abcaaTenantId
          }
          del(.abcaaTenantId)
          if (.jwtValStatus == null || .jwtSkipFailureReporting == true) {
            del(.jwtValStatus)
          }
          del(.jwtSkipFailureReporting)
          if (.rateLimitLimit == null) {
            del(.rateLimitLimit)
          }
          if (.rateLimitRemaining == null) {
            del(.rateLimitRemaining)
          }
          if (.rateLimitReset == null) {
            del(.rateLimitReset)
          }
          if (.rateLimitPolicy == null) {
            del(.rateLimitPolicy)
          }
        type: remap
      transform_remap_gateway_proxy_accesslog_all:
        inputs:
        - transform_remap_gateway_proxy_accesslog
        source: |
          if (.insightsProfile == "vector-e2e-test") {
            abort
          }
        type: remap
      transform_remap_gateway_proxy_accesslog_test:
        inputs:
        - transform_remap_gateway_proxy_accesslog
        source: |
          if (.insightsProfile != "vector-e2e-test") {
            abort
          }
          del(.insightsProfile)
        type: remap
      transform_route_gateway_proxy:
        inputs:
        - kubernetes_logs
        route:
          '*': .kubernetes.container_name == "gateway-proxy"
          access_log: starts_with(string!(.message), "{") && ends_with(string!(.message),
            "}")
        type: route
  dataDir: ""
  defaultVolumeMounts:
  - mountPath: /var/log/
    name: var-log
    readOnly: true
  - mountPath: /var/lib
    name: var-lib
    readOnly: true
  defaultVolumes:
  - hostPath:
      path: /var/log/
    name: var-log
  - hostPath:
      path: /var/lib/
    name: var-lib
  dnsConfig: {}
  dnsPolicy: ClusterFirst
  env: []
  envFrom: []
  existingConfigMaps: []
  extraContainers: []
  extraVolumeMounts: []
  extraVolumes:
  - name: secret-volume
    secret:
      secretName: vector-values
  fullnameOverride: ""
  global: {}
  haproxy:
    affinity: {}
    autoscaling:
      customMetric: {}
      enabled: false
      maxReplicas: 10
      minReplicas: 1
      targetCPUUtilizationPercentage: 80
    containerPorts: []
    customConfig: ""
    enabled: false
    existingConfigMap: ""
    extraContainers: []
    extraVolumeMounts: []
    extraVolumes: []
    image:
      pullPolicy: IfNotPresent
      pullSecrets: []
      repository: haproxytech/haproxy-alpine
      tag: 2.6.12
    initContainers: []
    livenessProbe:
      tcpSocket:
        port: 1024
    nodeSelector: {}
    podAnnotations: {}
    podLabels: {}
    podPriorityClassName: ""
    podSecurityContext: {}
    readinessProbe:
      tcpSocket:
        port: 1024
    replicas: 1
    resources: {}
    rollWorkload: true
    securityContext: {}
    service:
      annotations: {}
      externalTrafficPolicy: ""
      ipFamilies: []
      ipFamilyPolicy: ""
      loadBalancerIP: ""
      ports: []
      topologyKeys: []
      type: ClusterIP
    serviceAccount:
      annotations: {}
      automountToken: true
      create: true
    strategy: {}
    terminationGracePeriodSeconds: 60
    tolerations: []
  image:
    pullPolicy: IfNotPresent
    pullSecrets:
    - name: vector
    repository: build-releases-external.common.cdn.repositories.cloud.abc/timberio/vector
    tag: 0.32.1-distroless-libc
  ingress:
    annotations: {}
    className: ""
    enabled: false
    hosts: []
    tls: []
  initContainers: []
  lifecycle: {}
  livenessProbe: {}
  logLevel: info
  minReadySeconds: 0
  nameOverride: ""
  nodeSelector: {}
  persistence:
    accessModes:
    - ReadWriteOnce
    enabled: false
    existingClaim: ""
    finalizers:
    - kubernetes.io/pvc-protection
    hostPath:
      enabled: true
      path: /var/lib/vector
    selectors: {}
    size: 10Gi
  podAnnotations: {}
  podDisruptionBudget:
    enabled: false
    minAvailable: 1
  podHostNetwork: false
  podLabels:
    sidecar.istio.io/inject: "true"
    vector.dev/exclude: "true"
  podManagementPolicy: OrderedReady
  podMonitor:
    additionalLabels: {}
    enabled: false
    honorLabels: false
    honorTimestamps: true
    jobLabel: app.kubernetes.io/name
    metricRelabelings: []
    path: /metrics
    port: prom-exporter
    relabelings: []
  podPriorityClassName: ""
  podSecurityContext: {}
  psp:
    create: false
  rbac:
    create: true
  readinessProbe: {}
  replicas: 1
  resources:
    limits:
      cpu: 1000m
      memory: 2Gi
    requests:
      cpu: 10m
      memory: 128Mi
  role: Agent
  rollWorkload: false
  secrets:
    generic: {}
  securityContext: {}
  service:
    annotations: {}
    enabled: true
    externalTrafficPolicy: ""
    ipFamilies: []
    ipFamilyPolicy: ""
    loadBalancerIP: ""
    ports: []
    topologyKeys: []
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountToken: true
    create: true
  serviceHeadless:
    enabled: true
  terminationGracePeriodSeconds: 60
  tolerations:
  - effect: NoSchedule
    key: WorkGroup
    operator: Equal
    value: abproxy
  - effect: NoExecute
    key: WorkGroup
    operator: Equal
    value: abproxy
  topologySpreadConstraints: []
  ugwSecretConfigEnabled: false
  updateStrategy: {}
  workloadResourceAnnotations: {}

Version

0.32.1-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@tomer-epstein tomer-epstein added the type: bug A code related bug. label Nov 12, 2023
@tomer-epstein tomer-epstein changed the title CPU increase CPU increase and memory leak Nov 12, 2023
@bruceg
Copy link
Member

bruceg commented Nov 14, 2023

This does look like a bug.

Are you able to upgrade to version 0.34.0? There were a couple of memory leaks that were handled in recent versions that may resolve your issue.

Also, do you have any metrics that could be having unbounded cardinality? This could cause memory growth through the prometheus_exporter sink, though perhaps not as rapidly as what you are seeing.

@gadisn
Copy link
Contributor

gadisn commented Nov 14, 2023

@bruceg thanks.
Can you please elaborate on the unbounded cardinality metric question?
The metrics are the built-in ones from the various components (e.g. sinks) which are being used.
Is there a specific source/transform/sink you can point at which might have this problematic metric?

@bruceg
Copy link
Member

bruceg commented Nov 14, 2023

Hmm, I keyed on the Prometheus component in the config above and misread it as a source (so used to reading sources first). I see you don't have any metric sources other than internal_metrics, which should have effectively single cardinality for its metrics, so that was a red herring. However, there were a couple of buffer-related memory leaks that were addressed between version 0.32.0 and 0.34.0. Are you able to upgrade to 0.34.0?

@gadisn
Copy link
Contributor

gadisn commented Nov 15, 2023

Thanks, yes, an upgrade is certainly an option, we'll try it

@tomer-epstein
Copy link
Author

We upgraded to version 0.34.0, but still see increment in cpu & memory.

image

@vectordotdev vectordotdev deleted a comment Nov 20, 2023
@valerypetrov
Copy link

valerypetrov commented Nov 20, 2023

Hey guys, we have the same issue. By the way, I found very strange behavior on the internal_metrics sink. Please take a look:

 N=10; for ((i=1; i<=N; i++)); do curl -s  http://localhost:9598/metrics | wc -l; sleep 1; done
14329
14330
14332
14333
14333
14335
14336
14337
14337
14337

Seems the issue with vector_component_recieved_bytes_total metric:

curl -s  http://localhost:9598/metrics |  grep vector_component_received_bytes_total | wc -l
13886

The metrics are:

vector_component_received_bytes_total{component_id="tcp_input_no_tls",component_kind="source",component_name="tcp_input_no_tls",component_type="socket",host="HOST",peer_addr="IP:55080",protocol="tcp"} 310 1700521527947
vector_component_received_bytes_total{component_id="tcp_input_no_tls",component_kind="source",component_name="tcp_input_no_tls",component_type="socket",host="HOST",peer_addr="IP:55086",protocol="tcp"} 310 1700521527947

We do have an F5 loadbalancer for each vector instance. So, in these metrics, we will have only a set of F5's IPs and a bunch of ports.

@valerypetrov
Copy link

Seems like the upgrade to 0.34.1 resolved the issue with the memory leak & CPU increase.
Screen Shot 2023-11-22 at 10 34 40 AM

@pront
Copy link
Member

pront commented Dec 8, 2023

Seems like the upgrade to 0.34.1 resolved the issue with the memory leak & CPU increase. Screen Shot 2023-11-22 at 10 34 40 AM

Hello @tomer-epstein, can you try v0.34.1 as well?

@tomer-epstein
Copy link
Author

Hello @pront
Upgraded to version 0.34.1 .
We see an improvement, but can still see increment in memory & cpu,

image

@vibhavachar
Copy link

Seeing this issue in 0.28 version, will try to upgrade. Will disabling internal metrics solves the issue?

@sharonx
Copy link
Contributor

sharonx commented Dec 17, 2023

I'm using 0.34.1 and we are experiencing the same issue - 2 out of 14 pods - Daemonset, agent mode are running OOM every hour or so. The memory consumption looks steady for a while and spikes up very quickly and runs out of memory.

image

@tomer-epstein
Copy link
Author

Hello,
Using version 0.34.1.
We experience out of memory for the agent pods.
We also see log messages:

  • "level":"ERROR","msg":"Healthcheck failed.","error":"Queues are full"
  • "level":"ERROR","msg":"Healthcheck timed out."

image

@mordaby
Copy link

mordaby commented Jan 18, 2024

Hi in addition to @tomer-epstein comment we also found that the source problem could be relate to the metric vector expose after disabling the metric the pods seem to decrease their memory and cpu consumption.
we remove the source and the sink responsible for collecting/exposing vector metrics from the cluster:

removed from vector config:

sources :
    internal_metrics:
      type: internal_metrics
      scrape_interval_secs: 60
sinks:
    prom_exporter:
    type: prometheus_exporter
    inputs:
      - internal_metrics
    address: 0.0.0.0:9598
  

after we removed the metric config we saw vector pod resources stabilized

is it a config problem or it may be a vector memory leak?

@jszwedko
Copy link
Member

Do you see the cardinality of the metrics exposed by the prometheus_exporter sink growing over time? That's one hunch I would have: that the sink is receiving ever more metrics series.

This could be due to some components publishing telemetry with unbounded cardinality (like the file source which tagged internal metrics with a file tag). In v0.35.0, these "high cardinality" tags were removed or changed to be opt-in. Could you try that version?

There is also https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs which you can use to expire stale metric contexts.

@epandolfo-plaid
Copy link

I did some investigation into this, and my findings agree with @jszwedko’s comment above. I also confirm using expire_metrics_secs fixed for me in v0.31 (and presumably other versions below 0.35).

In our case, a few nodes in the cluster had much heavier pod churn than the rest of the nodes, and the vector agents we noticed with significantly increasing CPU and mem usage were exclusively on those nodes. One feature I noticed with internal_metrics is that the vector_component_received_*_total metrics include pod_name as a label. When the vector agent is on a node with heavy pod churn, the cardinality of this metric grows as every new pod is created.

I expected this might cause extra load on prometheus, but not vector itself. However, it appears Vector’s default behavior (at least in v0.31) is that metrics will continue to emit for every pod that has ever existed on its node. I could see this reflected in the rate at which events were sent to from internal_metrics to the pod’s prometheus exporter server. Rather than being constant, this rate was increasing. I believe this is the cause of the memory (and CPU) increasing over time; the amount of data sent from the internal_metrics source increases with every new pod creation, but never decreases with an old pod’s deletion.

Screenshot 2024-01-25 at 11 13 44 AM

To fix: I set the (by default unset) global option expire_metrics_secs, as suggested by @jszwedko.

I also created a transform that dropped the pod_name tag from the metrics that feature it. (I also made an analogous transform for the metrics which have one file tag for each pod). I did this because I am not currently using the pod_name label at all in my dashboards or alerts. This is not necessary for controlling cpu and memory growth once you enable expire_metrics_secs, but I had no use for that label and it helps with reducing load on prometheus.

Mem utilization before and after:
Screenshot 2024-01-29 at 9 46 30 AM

CPU utilization before and after:
Screenshot 2024-01-29 at 9 54 14 AM

(In both of these graphs, the very steep line before the fix is the node with very high pod churn)

@pront
Copy link
Member

pront commented Jan 29, 2024

To fix: I set the (by default unset) global option expire_metrics_secs, as suggested by @jszwedko.

Interesting! I am curious what value did you set it to?

@jszwedko: I also wonder if we want to consider a breaking change and always set expire_metrics_secs to a large value (e.g. 30 mins).

@epandolfo-plaid
Copy link

I set it to 2m, which does kind of ruin the metric, but I don't use that metric for agent pods. What I mean by ruins the metric is it drops down to 0 when the line would be flat for longer than 2m, and when it spikes back up it causes a massive spike in the rate. Note in this graph I multiplied the rate (yellow line) by 10 to make it more visibile.

Screenshot 2024-01-29 at 10 16 04 AM

Unless I misunderstand the documentation here, which says "Note that internal counters that are expired but are later updated will have their values reset to zero," it appears the counter that is expired does not have its value reset to 0 but has it reset to what it was previously (visible in the green line). Fixing that should make the rate line much more usable here.

Setting it to 30m as you suggest would definitely help with that as well as it will allow more time of no logs being recorded before the metric stops getting emitted. But it comes at the cost of slightly more steady state memory usage. Again, I have no use for this metric for agent pods, so I prefer to optimize for memory savings.

Overall I would agree with setting expire_metrics_secs by default.

@epandolfo-plaid
Copy link

As an aside, I suspect (but have not investigated deeply) that this issue where expired metrics are not getting fully garbage collected (as indicated by how the green line does not reset to 0 after each gap) is responsible for a small amt of memory increase, even once using expire_metrics_secs. One other piece of evidence to support that theory is this graph showing memory usage continuing to increase (at a much lower rate than before) while CPU usage stays the same (whereas before CPU increased proportionally with memory), which could indicate that these metrics are still being stored after they expire, but are never emitted again (so are associated with slightly more memory usage but no more cpu usage).
Screenshot 2024-01-29 at 10 37 10 AM

@gadisn
Copy link
Contributor

gadisn commented Jan 30, 2024

@epandolfo-plaid thanks for sharing your info. It adds a lot of value.

Can you please share the topology you mentioned in I also created a transform that dropped the pod_name tag from the metrics that feature it - i.e. what is the input of that transform? Who uses it?

I believe the input isn't the internal_metrics source, right? At that point the metric was already collected. Or does dropping it after internal_metrics still leads to memory reduction?

@epandolfo-plaid
Copy link

No problem! The input of the transform is internal_metrics and prometheus_exporter consumes it. You are correct that at this point the metric is already collected, and I believe you are correct that dropping it after internal_metrics does not lead to vector memory reduction. The only reason I did this is to limit the cardinality of these metrics stored in prometheus, after the metrics have been scraped from the vector pod.

@mordaby
Copy link

mordaby commented Feb 22, 2024

we found that the expire_metrics_secs global field was doing the trick we lower the expire metric second to 15 minutes, and now vector stats are stable, and the pod is function as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

10 participants