default agent config causes metric cardinality explosion in Prometheus #229

ccmtaylor · 2022-07-18T08:41:54Z

We recently deployed a vector agent based on the default configuration to a relatively busy Kubernetes cluster (~300 nodes, ~8000 pods). Some of the metrics have unbounded cardinality on some of the tags.

In particular, the file-based metrics (vector_files_added_total, vector_files_unwatched_total) have a file tag, causing their cardinality to reach millions of time series over a couple of days. This had a noticeable performance impact on the overall observability infrastructure (based on Prometheus/Thanos).

As a workaround, we're including the following remap transform in our customConfig:

  transforms:
    reduce_metrics:
      type: remap
      inputs: [internal_metrics]
      # The file tag contains the (never re-used) file names. This causes a
      # high cardinality of "file added" and "file removed" counters, each at a
      # value of 1. Remove the tag so that we count each of these events into
      # the overall metric.
      source: 'del(.tags.file)'
  sinks:
    prom_exporter:
      type: prometheus_exporter
      inputs: [reduce_metrics]
      address: 0.0.0.0:9090

Would it make sense to include this in the default configuration?

The text was updated successfully, but these errors were encountered:

ccmtaylor · 2022-07-18T08:42:48Z

Note: some other metrics (i.e. vector_component_received_event_bytes_total, vector_checksum_errors_total, vector_component_received_events_total) include the pod name of the pods that produce logs as a tag. This technically also causes unbounded cardinality, though at a much lower rate which is fine for our use case.

tuananhnguyen-ct · 2022-08-02T03:06:33Z

No, this is being discussed on vector vectordotdev/vector#11995 and dropping the tags (with a transform) will not help with the amount of metrics being collected from internal_metrics, only reduce the data sent out at the sink. Any fix on the chart itself won't be able to help in this case.

spencergilbert · 2022-08-03T17:17:07Z

Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion.

I'll close this as a duplicate of vectordotdev/vector#11995.

ccmtaylor · 2022-08-04T11:02:14Z

Thanks for the replies and pointers! I’ve subscribed to the upstream issue.

…

On 3. Aug 2022, at 19:17, Spencer Gilbert ***@***.***> wrote: Sorry - this clearly passed my notice! @tuananhnguyen-ct is correct, dropping the tags (or the tag_cardinality_limit transform) should protect your downstream Prometheus from cardinality issues. However Vector will still be tracking things internally and we'll need to solve this in a more complete fashion. I'll close this as a duplicate of vectordotdev/vector#11995. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

spencergilbert closed this as completed Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default agent config causes metric cardinality explosion in Prometheus #229

default agent config causes metric cardinality explosion in Prometheus #229

ccmtaylor commented Jul 18, 2022

ccmtaylor commented Jul 18, 2022

tuananhnguyen-ct commented Aug 2, 2022

spencergilbert commented Aug 3, 2022

ccmtaylor commented Aug 4, 2022 via email

default agent config causes metric cardinality explosion in Prometheus #229

default agent config causes metric cardinality explosion in Prometheus #229

Comments

ccmtaylor commented Jul 18, 2022

ccmtaylor commented Jul 18, 2022

tuananhnguyen-ct commented Aug 2, 2022

spencergilbert commented Aug 3, 2022

ccmtaylor commented Aug 4, 2022 via email