Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default agent config causes metric cardinality explosion in Prometheus #229

Closed
ccmtaylor opened this issue Jul 18, 2022 · 4 comments
Closed

Comments

@ccmtaylor
Copy link

We recently deployed a vector agent based on the default configuration to a relatively busy Kubernetes cluster (~300 nodes, ~8000 pods). Some of the metrics have unbounded cardinality on some of the tags.

In particular, the file-based metrics (vector_files_added_total, vector_files_unwatched_total) have a file tag, causing their cardinality to reach millions of time series over a couple of days. This had a noticeable performance impact on the overall observability infrastructure (based on Prometheus/Thanos).

As a workaround, we're including the following remap transform in our customConfig:

  transforms:
    reduce_metrics:
      type: remap
      inputs: [internal_metrics]
      # The file tag contains the (never re-used) file names. This causes a
      # high cardinality of "file added" and "file removed" counters, each at a
      # value of 1. Remove the tag so that we count each of these events into
      # the overall metric.
      source: 'del(.tags.file)'
  sinks:
    prom_exporter:
      type: prometheus_exporter
      inputs: [reduce_metrics]
      address: 0.0.0.0:9090

Would it make sense to include this in the default configuration?

@ccmtaylor
Copy link
Author

Note: some other metrics (i.e. vector_component_received_event_bytes_total, vector_checksum_errors_total, vector_component_received_events_total) include the pod name of the pods that produce logs as a tag. This technically also causes unbounded cardinality, though at a much lower rate which is fine for our use case.

@tuananhnguyen-ct
Copy link
Contributor

No, this is being discussed on vector vectordotdev/vector#11995 and dropping the tags (with a transform) will not help with the amount of metrics being collected from internal_metrics, only reduce the data sent out at the sink. Any fix on the chart itself won't be able to help in this case.

@spencergilbert
Copy link
Contributor

Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.

I'll close this as a duplicate of vectordotdev/vector#11995.

@ccmtaylor
Copy link
Author

ccmtaylor commented Aug 4, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants