OPA generates a lot of bundle metrics and floods system #4584

costimuraru · 2022-04-15T08:49:57Z

Short description

We have OPA deployed in Kubernetes as a standalone service (ie. 3 pods). We generate a new bundle every 30 seconds (bundle contains updated rego policies), which OPA downloads. We also scrape the metrics on each OPA pod (via Prometheus) to monitor them. We've noticed that the number of metrics emitted by OPA increases dramatically - from ~400 metrics when OPA starts to around 200 000+ after a few days. This is per pod. It seems that the increase is attributed to a series of metrics that relate to the bundle:

bundle_loading_duration_ns
last_success_bundle_activation
last_success_bundle_download
last_success_bundle_request
etc.

Each of this metric has a label named active_revision, which I believe is the bundle id. Given that we load a new bundle every 30 seconds, the number of metrics increases fast.

You can see the output of OPA /metrics with 200k+ metrics in this Gist.

Examples:

OPA version
0.39
Example query, input, data, and policy that OPA was given
Example output that OPA returned
curl http://opa-pod:8182/metrics | wc -l
229129

Steps To Reproduce

Start OPA server, which downloads a new bundle every 30 seconds.
Check the metrics outputted by /metrics over time

Expected behavior

A constant number of metrics should be outputted over time.

Actual behavior

The number of metrics increases over time dramatically. It starts with 400 metrics and in 2 days, in our case, it reached 229 000 metrics. This floods our systems (Prometheus/Cortex), where we have quota on the number of unique metrics scrapped. The scrapping itself also takes a lot of time (10-20 seconds).

Additional context

The text was updated successfully, but these errors were encountered:

srenatus · 2022-04-15T10:01:14Z

Thanks for reporting! Indeed it sounds like having the active revision label is the problem. Would the metrics still be useful without that? 🤔

srenatus · 2022-04-15T10:19:32Z

Cc @rafaelreinert what do you think? I guess we could either make it a feature you'd need to enable, or drop the labels?

costimuraru · 2022-04-15T10:40:16Z

@srenatus in our case the metrics are valuable, even without the active_revision label. In fact, we already have recording rules in place that discard that label. The problem is that the number of metrics on the OPA increases constantly, to the point where Prometheus can’t scrape it anymore.
We want to know how the system behaves over time, regardless of what was the actual bundle that misbehaved. If we detect an anomaly, say bundle_loading_duration > x seconds at dateY, we can always go to the logs and see what happened at that time, what was the bundle that generated the issue and so forth.

rafaelreinert · 2022-04-18T12:34:13Z

Hey @srenatus and @costimuraru , about the flag it already exists, in order to enable the status metrics on Prometheus it must be configured and the default value is false ( see https://www.openpolicyagent.org/docs/latest/management-status/#prometheus-status-metrics). I've done that just to not overflow the metrics in workloads like yours. I really believe someone activated it in your infrastructure. (Please let me know if the flag is not set, maybe a bug).
About the high cardinality of the metrics, it depends on the workload. For example, on my infrastructure, the OPA is distributed as a sidecar on every pod and is really important for monitoring which version of the bundle each pod has (the bundle doesn’t change too often ).
But I understand that doesn’t fit well when a workload has too many bundles in a short period of time, One alternative maybe is to create a new flag on the status config that configure the level of the status metrics. Maybe:

full : all metrics with all bundles
simple: only global metrics

What do you think about this approach?

rafaelreinert · 2022-04-18T13:10:05Z

Hey @srenatus and @costimuraru, I was thinking again about this issue and I realized what is maybe root cause, The problem is that, this function https://github.com/open-policy-agent/opa/blob/main/plugins/status/plugin.go#L478 is not resetting the older bundles metrics, I've not seen that as an issue because my environment has few bundles and the opa instance lifespan is less than 3 days. But in an environment that has a long lifespan and many bundles that become an issue because the number of metrics exported increase a lot. This week I will try to fix it and reset the metrics because if the bundle is not used anymore it doesn't be exported.
@costimuraru what do you think ?

costimuraru · 2022-04-18T19:41:26Z

Thanks for the details on the rationale behind this behavior, @rafaelreinert.
I'm trying to understand your suggestion with resetting the metrics. Are you saying that each time a bundle is loaded, the metrics will be reset and they will take the values based on that latest bundle? I worry that would not work fine with the counters. From the docs in Prometheus:

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

For instance, the bundle_failed_load_counter should (ideally) increase from the moment the opa server starts till it gets terminated. By this logic, it should not get reset when a new bundle is loaded. This allows us to run Prometheus queries such rate on this counter.

For our use case, I think what we need is to just remove the active_revision label from these metrics. Here is a change that seems to work fine for our use case: https://github.com/open-policy-agent/opa/compare/main...costimuraru:fix-metrics-cardinality?expand=1 - let me know if you see issues with this, we plan to give it a try (in prod), to have a quick fix :-D. In this way we have an aggregated view on these bundle metrics, which is what we need (a global view as you mentioned).

Coming back to the metric types, the gauge is probably not that important, cause it retains the latest value, right?
As for the histogram, I think having one across all bundles is actually a good thing: we can infer things like "95% of the bundles took less than X seconds to load", which should be possible now that we don't have one histogram per bundle, no?

Your suggestion to have a flag which makes it possible to select between these 2 behaviors is probably best (metrics with and without the active_revision label).
Wdyt?

costimuraru · 2022-04-19T17:07:42Z

After stripping the activeRevision label, OPA is looking much better:

rafaelreinert · 2022-04-19T20:32:19Z

I am thinking about that, maybe the best solution is to remove the active_revision for all metrics (as you have done) and create another metric (gauge) last_active_revision with the active_revision as label and reset it each update.
With this solution, we can have the active_revision from every bundle and we eliminate the other metrics' high cardinality.

srenatus · 2022-04-20T10:25:08Z

@rafaelreinert that sounds reasonable. @costimuraru what do you think?

@rafaelreinert would you be able to pick up making this change? 😃 (I'll take care of it if it's too much on your plate right now.)

costimuraru · 2022-04-20T12:09:08Z

Thanks @srenatus.
I've incorporated the feedback from @rafaelreinert and opened #4600. I'm gonna test it today and tomorrow in our env and see how it works. (I plan to use Grafana annotations with the newly added metrics that @rafaelreinert suggested and see how it behaves).

Having one activeRevision label on each of the prometheus metrics emitted by the status plugin has proven to be problematic with a large number of bundles. So with this change, 1. we keep the activeRevision label (just on) the last_success_bundle_activation metric. 2. the gauge gets reset, so we only keep the last active_revision (instead of keeping them all and therefore avoiding the situation where the /metrics output grows indefinitely) Fixes #4584. Signed-off-by: cmuraru <cmuraru@adobe.com>

…y-agent#4600) Having one activeRevision label on each of the prometheus metrics emitted by the status plugin has proven to be problematic with a large number of bundles. So with this change, 1. we keep the activeRevision label (just on) the last_success_bundle_activation metric. 2. the gauge gets reset, so we only keep the last active_revision (instead of keeping them all and therefore avoiding the situation where the /metrics output grows indefinitely) Fixes open-policy-agent#4584. Signed-off-by: cmuraru <cmuraru@adobe.com>

# This is the 1st commit message: finalizing changes for formatting with sprintf Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#2: updating changes to allow for multiple format strings Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#3: fixing golint issues Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#4: fixing golint issues Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#5: making recommended change: package level variable Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#6: adding support for explicit argument indexes Signed-off-by: Damien Burks <damien@damienjburks.com> # This is the commit message open-policy-agent#7: format: don't add 'in' keyword import when 'every' is there (open-policy-agent#4607) Also ensure that added imports have a location set. Previously, `opa fmt` on the added test file would have panicked because the import hadn't had a location. Fixes open-policy-agent#4606. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#8: ast+topdown+planner: allow for mocking built-in functions via "with" (open-policy-agent#4540) With this change, we can replace calls to built-in functions via `with`. The replacement can either be a value -- which will be used as the return value for every call to the mocked built-in -- or a reference to a non-built-in function -- when the results need to depend on the call's arguments. Compiler, topdown, and planner have been adapted in this change. The included docs changes describe the replacement options further. Fixes first part of open-policy-agent#4449. (Missing are non-built-in functions as mock targets.) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#9: build(deps): bump google.golang.org/grpc from 1.45.0 to 1.46.0 (open-policy-agent#4617) # This is the commit message open-policy-agent#10: docs/policy-testing: use assignment operator in mocks (open-policy-agent#4618) Additionally, simplify one test example. Signed-off-by: Anders Eknert <anders@eknert.com> # This is the commit message open-policy-agent#11: cmd/capabilities: expose capabilities through CLI (open-policy-agent#4588) There is a new command argument "capabilities". With this, it is possible to print the current capabilities version, show all capabilities versions & print any capabilities version, without the need of a file. Moreover, for the other commands which use the --capabilities flag, it is possible to give only the version number, without specifying a file. However, there are no breaking changes for those who use the capabilities file as an input for the flag. Unit tests were also written, in order to test the new argument and the changes made in ast. Fixes: open-policy-agent#4236 Signed-off-by: IoannisMatzaris <matzarisioannis@gmail.com> # This is the commit message open-policy-agent#12: format,eval: don't use source locations when formatting PE output (open-policy-agent#4611) * format: allow ignoreing source locations * cmd/eval: format disregarding source locations for partial result Before, we'd see this output: ``` $ opa eval -p -fsource 'time.clock(input.x)==time.clock(input.y)' # Query 1 time.clock(time.clock(input.x), input.y) ``` Now, we get the proper answer: `time.clock(input.y, time.clock(input.x))`. Note that it's a _display_ issue; the JSON output of PE has not been affected. Fixes open-policy-agent#4609. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#13: build(deps): bump github/codeql-action from 1 to 2 (open-policy-agent#4621) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 1 to 2. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v1...v2) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> # This is the commit message open-policy-agent#14: status: Remove activeRevision label on all but one metric (open-policy-agent#4600) Having one activeRevision label on each of the prometheus metrics emitted by the status plugin has proven to be problematic with a large number of bundles. So with this change, 1. we keep the activeRevision label (just on) the last_success_bundle_activation metric. 2. the gauge gets reset, so we only keep the last active_revision (instead of keeping them all and therefore avoiding the situation where the /metrics output grows indefinitely) Fixes open-policy-agent#4584. Signed-off-by: cmuraru <cmuraru@adobe.com> # This is the commit message open-policy-agent#15: website: add playground button to navbar (open-policy-agent#4622) Addressing one tiny bit of open-policy-agent#4614. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#16: topdown/net: require prefix length for IPv6 in net.cidr_merge (open-policy-agent#4613) There are no default prefixes in IPv6, so if an IPv6 without a prefix is fed into net.cidr_merge, we'll return a non-halt error now. Before, we'd fail in various ways if a prefix-less IPv6 was fed into `net.cidr_merge`. With only one, we'd return `[ "<nil>" ]`, with two, we'd panic. Fixes open-policy-agent#4596. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#17: Dockerfile: add source annotation (open-policy-agent#4626) `org.opencontainers.image.source` URL to get source code for building the image (string) https://github.com/opencontainers/image-spec/blob/main/annotations.md Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#18: build(deps): bump github.com/fsnotify/fsnotify v1.5.2 -> v1.5.4 (open-policy-agent#4628) https://github.com/fsnotify/fsnotify/releases/tag/v1.5.4 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#19: docs: update version in kubernetes examples (open-policy-agent#4627) Signed-off-by: yongen.pan <yongen.pan@daocloud.io> # This is the commit message open-policy-agent#20: bundle/status: Include bundle type in status information OPA has support for Delta Bundles. The status object already contains valuable information such as last activation timestamp but does not specify if the bundle was a canonical snapshot or delta. This change updates the bundle.Status object to include the bundle type string: either "snapshot" or "delta". This can be useful for status endpoints to differentiate between the bundle types. Issue: 4477 Signed-off-by: Bryan Fulton <bryan@styra.com> # This is the commit message open-policy-agent#21: ast+topdown+planner: replacement of non-built-in functions via 'with' (open-policy-agent#4616) Follow-up to open-policy-agent#4540 We can now mock functions that are user-defined: package test f(_) = 1 { input.x = "x" } p = y { y := f(1) with f as 2 } ...following the same scoping rules as laid out for built-in mocks. The replacement can be a value (replacing all calls), or a built-in, or another non-built-in function. Also addresses bugs in the previous slice: * topdown/evalCall: account for empty rules result from indexer * topdown/eval: capture value replacement in PE could panic Note: in PE, we now drop 'with' for function mocks of any kind: These are always fully replaced in the saved support modules, so this should be OK. When keeping them, we'd also have to either copy the existing definitions into the support module; or create a function stub in it. Fixes open-policy-agent#4449. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#22: format: keep whitespaces for multiple indented same-line withs (open-policy-agent#4635) Fixes open-policy-agent#4634. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#23: downloader: support for downloading bundles from an OCI registry (open-policy-agent#4558) Initial support for open-policy-agent#4518. Configuration uses the 'services' config for registries, via the "type: oci" field. Bundles configured to pull from that service will then use OCI. ``` services: ghcr-registry: url: https://ghcr.io type: oci bundles: authz: service: ghcr-registry resource: ghcr.io/${ORGANIZATION}/${REPOSITORY}:${TAG} persist: true polling: min_delay_seconds: 60 max_delay_seconds: 120 persistence_directory: ${PERSISTENCE_PATH} ``` Service credentials are supported: if you want to pull from a private registry, use ``` services: ghcr-registry: url: https://ghcr.io type: oci credentials: bearer: token: ${GH_PAT} ``` If no `persistence_directory` is configured, the data is stored in a directory under /tmp. See docs/devel/OCI.md for manual steps to test this feature with some OCI registry (like ghcr.io). Signed-off-by: carabasdaniel <dani@aserto.com> # This is the commit message open-policy-agent#24: Prepare v0.40.0 Release (open-policy-agent#4631) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#25: Prepare v0.41.0 development (open-policy-agent#4636) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#26: docs: Adding example for `rego.metadata.role()` usage (open-policy-agent#4640) Signed-off-by: Johan Fylling <johan.dev@fylling.se> # This is the commit message open-policy-agent#27: build(deps): bump oras.land/oras-go from 1.1.0 to 1.1.1 (open-policy-agent#4643) Bumps [oras.land/oras-go](https://github.com/oras-project/oras-go) from 1.1.0 to 1.1.1. - [Release notes](https://github.com/oras-project/oras-go/releases) - [Commits](oras-project/oras-go@v1.1.0...v1.1.1) --- updated-dependencies: - dependency-name: oras.land/oras-go dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> # This is the commit message open-policy-agent#28: build(deps): bump OpenTelemetry 1.6.3 -> 1.7.0 (open-policy-agent#4649) https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.7.0 https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.7.0 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#29: build(deps): bump github.com/containerd/containerd from 1.6.2 to 1.6.3 (open-policy-agent#4654) Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.2 to 1.6.3. - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](containerd/containerd@v1.6.2...v1.6.3) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> # This is the commit message open-policy-agent#30: Update k8s examples to the latest schema (open-policy-agent#4655) Signed-off-by: Víctor Martínez Bevià <vicmarbev@gmail.com> # This is the commit message open-policy-agent#31: Fix incorrect padding claims (open-policy-agent#4657) Signed-off-by: Anders Eknert <anders@eknert.com> # This is the commit message open-policy-agent#32: build(deps): bump github.com/containerd/containerd from 1.6.3 to 1.6.4 (open-policy-agent#4662) Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.3 to 1.6.4. - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](containerd/containerd@v1.6.3...v1.6.4) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> # This is the commit message open-policy-agent#33: build(deps): bump docker/setup-qemu-action from 1 to 2 (open-policy-agent#4668) # This is the commit message open-policy-agent#34: build(deps): bump docker/setup-buildx-action from 1 to 2 (open-policy-agent#4669) Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 1 to 2. - [Release notes](https://github.com/docker/setup-buildx-action/releases) - [Commits](docker/setup-buildx-action@v1...v2) --- updated-dependencies: - dependency-name: docker/setup-buildx-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> # This is the commit message open-policy-agent#35: build(deps): github.com/bytecodealliance/wasmtime-go 0.35.0 -> 0.36.0 (open-policy-agent#4652) * build(deps): bump wasmtime-go: 0.35.0 -> 0.36.0 * internal/wasm: adapt to using epoch-based interruption Looks like we don't get frames for this. Also, there is currentlty no better way than comparing the message, as the trap code isn't surfaced (yet). Fixes open-policy-agent#4663. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> # This is the commit message open-policy-agent#36: ecosystem: Add Sansshell (open-policy-agent#4674) Signed-off-by: James Chacon <james.chacon@snowflake.com> # This is the commit message open-policy-agent#37: topdown: Add units.parse builtin (open-policy-agent#4676) This function works on all base decimal and binary SI units of the set: m, K/Ki, M/Mi, G/Gi, T/Ti, P/Pi, and E/Ei Note: Unlike `units.parse_bytes`, this function is case sensitive. Fixes open-policy-agent#1802. Signed-off-by: Philip Conrad <philipaconrad@gmail.com> # This is the commit message open-policy-agent#38: docs/contrib-code: Add capabilities step to built-in functions tutorial (open-policy-agent#4677) Signed-off-by: Philip Conrad <philipaconrad@gmail.com>

costimuraru added the bug label Apr 15, 2022

costimuraru mentioned this issue Apr 20, 2022

status: Remove activeRevision label and generate one dedicated metric with it #4600

Merged

srenatus closed this as completed in #4600 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPA generates a lot of bundle metrics and floods system #4584

OPA generates a lot of bundle metrics and floods system #4584

costimuraru commented Apr 15, 2022

srenatus commented Apr 15, 2022

srenatus commented Apr 15, 2022

costimuraru commented Apr 15, 2022

rafaelreinert commented Apr 18, 2022

rafaelreinert commented Apr 18, 2022

costimuraru commented Apr 18, 2022 •

edited

Loading

costimuraru commented Apr 19, 2022

rafaelreinert commented Apr 19, 2022

srenatus commented Apr 20, 2022 •

edited

Loading

costimuraru commented Apr 20, 2022

OPA generates a lot of bundle metrics and floods system #4584

OPA generates a lot of bundle metrics and floods system #4584

Comments

costimuraru commented Apr 15, 2022

Short description

Steps To Reproduce

Expected behavior

Actual behavior

Additional context

srenatus commented Apr 15, 2022

srenatus commented Apr 15, 2022

costimuraru commented Apr 15, 2022

rafaelreinert commented Apr 18, 2022

rafaelreinert commented Apr 18, 2022

costimuraru commented Apr 18, 2022 • edited Loading

costimuraru commented Apr 19, 2022

rafaelreinert commented Apr 19, 2022

srenatus commented Apr 20, 2022 • edited Loading

costimuraru commented Apr 20, 2022

costimuraru commented Apr 18, 2022 •

edited

Loading

srenatus commented Apr 20, 2022 •

edited

Loading