Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA generates a lot of bundle metrics and floods system #4584

Closed
costimuraru opened this issue Apr 15, 2022 · 10 comments · Fixed by #4600
Closed

OPA generates a lot of bundle metrics and floods system #4584

costimuraru opened this issue Apr 15, 2022 · 10 comments · Fixed by #4600
Labels

Comments

@costimuraru
Copy link
Contributor

Short description

We have OPA deployed in Kubernetes as a standalone service (ie. 3 pods). We generate a new bundle every 30 seconds (bundle contains updated rego policies), which OPA downloads. We also scrape the metrics on each OPA pod (via Prometheus) to monitor them. We've noticed that the number of metrics emitted by OPA increases dramatically - from ~400 metrics when OPA starts to around 200 000+ after a few days. This is per pod. It seems that the increase is attributed to a series of metrics that relate to the bundle:

  • bundle_loading_duration_ns
  • last_success_bundle_activation
  • last_success_bundle_download
  • last_success_bundle_request
  • etc.

Each of this metric has a label named active_revision, which I believe is the bundle id. Given that we load a new bundle every 30 seconds, the number of metrics increases fast.

You can see the output of OPA /metrics with 200k+ metrics in this Gist.

Examples:

  • OPA version
    0.39
  • Example query, input, data, and policy that OPA was given
  • Example output that OPA returned
    curl http://opa-pod:8182/metrics | wc -l
    229129

Steps To Reproduce

  1. Start OPA server, which downloads a new bundle every 30 seconds.
  2. Check the metrics outputted by /metrics over time

Expected behavior

A constant number of metrics should be outputted over time.

Actual behavior

The number of metrics increases over time dramatically. It starts with 400 metrics and in 2 days, in our case, it reached 229 000 metrics. This floods our systems (Prometheus/Cortex), where we have quota on the number of unique metrics scrapped. The scrapping itself also takes a lot of time (10-20 seconds).

Additional context

@srenatus
Copy link
Contributor

Thanks for reporting! Indeed it sounds like having the active revision label is the problem. Would the metrics still be useful without that? 🤔

@srenatus
Copy link
Contributor

Cc @rafaelreinert what do you think? I guess we could either make it a feature you'd need to enable, or drop the labels?

@costimuraru
Copy link
Contributor Author

@srenatus in our case the metrics are valuable, even without the active_revision label. In fact, we already have recording rules in place that discard that label. The problem is that the number of metrics on the OPA increases constantly, to the point where Prometheus can’t scrape it anymore.
We want to know how the system behaves over time, regardless of what was the actual bundle that misbehaved. If we detect an anomaly, say bundle_loading_duration > x seconds at dateY, we can always go to the logs and see what happened at that time, what was the bundle that generated the issue and so forth.

@rafaelreinert
Copy link
Contributor

Hey @srenatus and @costimuraru , about the flag it already exists, in order to enable the status metrics on Prometheus it must be configured and the default value is false ( see https://www.openpolicyagent.org/docs/latest/management-status/#prometheus-status-metrics). I've done that just to not overflow the metrics in workloads like yours. I really believe someone activated it in your infrastructure. (Please let me know if the flag is not set, maybe a bug).
About the high cardinality of the metrics, it depends on the workload. For example, on my infrastructure, the OPA is distributed as a sidecar on every pod and is really important for monitoring which version of the bundle each pod has (the bundle doesn’t change too often ).
But I understand that doesn’t fit well when a workload has too many bundles in a short period of time, One alternative maybe is to create a new flag on the status config that configure the level of the status metrics. Maybe:

  • full : all metrics with all bundles
  • simple: only global metrics

What do you think about this approach?

@rafaelreinert
Copy link
Contributor

Hey @srenatus and @costimuraru, I was thinking again about this issue and I realized what is maybe root cause, The problem is that, this function https://github.com/open-policy-agent/opa/blob/main/plugins/status/plugin.go#L478 is not resetting the older bundles metrics, I've not seen that as an issue because my environment has few bundles and the opa instance lifespan is less than 3 days. But in an environment that has a long lifespan and many bundles that become an issue because the number of metrics exported increase a lot. This week I will try to fix it and reset the metrics because if the bundle is not used anymore it doesn't be exported.
@costimuraru what do you think ?

@costimuraru
Copy link
Contributor Author

costimuraru commented Apr 18, 2022

Thanks for the details on the rationale behind this behavior, @rafaelreinert.
I'm trying to understand your suggestion with resetting the metrics. Are you saying that each time a bundle is loaded, the metrics will be reset and they will take the values based on that latest bundle? I worry that would not work fine with the counters. From the docs in Prometheus:

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

For instance, the bundle_failed_load_counter should (ideally) increase from the moment the opa server starts till it gets terminated. By this logic, it should not get reset when a new bundle is loaded. This allows us to run Prometheus queries such rate on this counter.

For our use case, I think what we need is to just remove the active_revision label from these metrics. Here is a change that seems to work fine for our use case: https://github.com/open-policy-agent/opa/compare/main...costimuraru:fix-metrics-cardinality?expand=1 - let me know if you see issues with this, we plan to give it a try (in prod), to have a quick fix :-D. In this way we have an aggregated view on these bundle metrics, which is what we need (a global view as you mentioned).

Coming back to the metric types, the gauge is probably not that important, cause it retains the latest value, right?
As for the histogram, I think having one across all bundles is actually a good thing: we can infer things like "95% of the bundles took less than X seconds to load", which should be possible now that we don't have one histogram per bundle, no?

Your suggestion to have a flag which makes it possible to select between these 2 behaviors is probably best (metrics with and without the active_revision label).
Wdyt?

@costimuraru
Copy link
Contributor Author

After stripping the activeRevision label, OPA is looking much better:

Screenshot 2022-04-19 at 20 05 57

Screenshot 2022-04-19 at 20 06 03

Screenshot 2022-04-19 at 20 06 09

@rafaelreinert
Copy link
Contributor

I am thinking about that, maybe the best solution is to remove the active_revision for all metrics (as you have done) and create another metric (gauge) last_active_revision with the active_revision as label and reset it each update.
With this solution, we can have the active_revision from every bundle and we eliminate the other metrics' high cardinality.

@srenatus
Copy link
Contributor

srenatus commented Apr 20, 2022

@rafaelreinert that sounds reasonable. @costimuraru what do you think?

@rafaelreinert would you be able to pick up making this change? 😃 (I'll take care of it if it's too much on your plate right now.)

@costimuraru
Copy link
Contributor Author

Thanks @srenatus.
I've incorporated the feedback from @rafaelreinert and opened #4600. I'm gonna test it today and tomorrow in our env and see how it works. (I plan to use Grafana annotations with the newly added metrics that @rafaelreinert suggested and see how it behaves).

srenatus pushed a commit that referenced this issue Apr 26, 2022
Having one activeRevision label on each of the prometheus metrics emitted
by the status plugin has proven to be problematic with a large number of
bundles. So with this change,

1. we keep the activeRevision label (just on) the last_success_bundle_activation metric.
2. the gauge gets reset, so we only keep the last active_revision (instead of keeping
   them all and therefore avoiding the situation where the /metrics output grows indefinitely)

Fixes #4584.

Signed-off-by: cmuraru <cmuraru@adobe.com>
rokkiter pushed a commit to rokkiter/opa that referenced this issue Apr 26, 2022
…y-agent#4600)

Having one activeRevision label on each of the prometheus metrics emitted
by the status plugin has proven to be problematic with a large number of
bundles. So with this change,

1. we keep the activeRevision label (just on) the last_success_bundle_activation metric.
2. the gauge gets reset, so we only keep the last active_revision (instead of keeping
   them all and therefore avoiding the situation where the /metrics output grows indefinitely)

Fixes open-policy-agent#4584.

Signed-off-by: cmuraru <cmuraru@adobe.com>
damienjburks added a commit to damienjburks/opa that referenced this issue May 18, 2022
# This is the 1st commit message:

finalizing changes for formatting with sprintf

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#2:

updating changes to allow for multiple format strings

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#3:

fixing golint issues

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#4:

fixing golint issues

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#5:

making recommended change: package level variable

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#6:

adding support for explicit argument indexes

Signed-off-by: Damien Burks <damien@damienjburks.com>

# This is the commit message open-policy-agent#7:

format: don't add 'in' keyword import when 'every' is there (open-policy-agent#4607)

Also ensure that added imports have a location set.

Previously, `opa fmt` on the added test file would have panicked
because the import hadn't had a location.

Fixes open-policy-agent#4606.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#8:

ast+topdown+planner: allow for mocking built-in functions via "with" (open-policy-agent#4540)

With this change, we can replace calls to built-in functions via `with`. The replacement
can either be a value -- which will be used as the return value for every call to the
mocked built-in -- or a reference to a non-built-in function -- when the results need
to depend on the call's arguments.

Compiler, topdown, and planner have been adapted in this change. The included
docs changes describe the replacement options further.

Fixes first part of open-policy-agent#4449. (Missing are non-built-in functions as mock targets.)

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#9:

build(deps): bump google.golang.org/grpc from 1.45.0 to 1.46.0 (open-policy-agent#4617)


# This is the commit message open-policy-agent#10:

docs/policy-testing: use assignment operator in mocks (open-policy-agent#4618)

Additionally, simplify one test example.

Signed-off-by: Anders Eknert <anders@eknert.com>
# This is the commit message open-policy-agent#11:

cmd/capabilities: expose capabilities through CLI (open-policy-agent#4588)

There is a new command argument "capabilities". With this, it is
possible to print the current capabilities version, show all
capabilities versions & print any capabilities version, without the need
of a file. Moreover, for the other commands which use the --capabilities
flag, it is possible to give only the version number, without specifying
a file. However, there are no breaking changes for those who use the
capabilities file as an input for the flag. Unit tests were also
written, in order to test the new argument and the changes made in ast.

Fixes: open-policy-agent#4236

Signed-off-by: IoannisMatzaris <matzarisioannis@gmail.com>
# This is the commit message open-policy-agent#12:

format,eval: don't use source locations when formatting PE output (open-policy-agent#4611)

* format: allow ignoreing source locations
* cmd/eval: format disregarding source locations for partial result

Before, we'd see this output:
```
$ opa eval -p -fsource 'time.clock(input.x)==time.clock(input.y)'
# Query 1
time.clock(time.clock(input.x), input.y)
```

Now, we get the proper answer: `time.clock(input.y, time.clock(input.x))`.

Note that it's a _display_ issue; the JSON output of PE has not been affected.

Fixes open-policy-agent#4609.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#13:

build(deps): bump github/codeql-action from 1 to 2 (open-policy-agent#4621)

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 1 to 2.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v1...v2)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# This is the commit message open-policy-agent#14:

status: Remove activeRevision label on all but one metric (open-policy-agent#4600)

Having one activeRevision label on each of the prometheus metrics emitted
by the status plugin has proven to be problematic with a large number of
bundles. So with this change,

1. we keep the activeRevision label (just on) the last_success_bundle_activation metric.
2. the gauge gets reset, so we only keep the last active_revision (instead of keeping
   them all and therefore avoiding the situation where the /metrics output grows indefinitely)

Fixes open-policy-agent#4584.

Signed-off-by: cmuraru <cmuraru@adobe.com>
# This is the commit message open-policy-agent#15:

website: add playground button to navbar (open-policy-agent#4622)

Addressing one tiny bit of open-policy-agent#4614.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#16:

topdown/net: require prefix length for IPv6 in net.cidr_merge (open-policy-agent#4613)

There are no default prefixes in IPv6, so if an IPv6 without a prefix is fed into
net.cidr_merge, we'll return a non-halt error now.

Before, we'd fail in various ways if a prefix-less IPv6 was fed into
`net.cidr_merge`. With only one, we'd return `[ "<nil>" ]`, with two,
we'd panic.

Fixes open-policy-agent#4596.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#17:

Dockerfile: add source annotation (open-policy-agent#4626)

`org.opencontainers.image.source` URL to get source code for building the image (string)

https://github.com/opencontainers/image-spec/blob/main/annotations.md

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#18:

build(deps): bump github.com/fsnotify/fsnotify v1.5.2 -> v1.5.4 (open-policy-agent#4628)

https://github.com/fsnotify/fsnotify/releases/tag/v1.5.4

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#19:

docs: update version in kubernetes examples (open-policy-agent#4627)

Signed-off-by: yongen.pan <yongen.pan@daocloud.io>
# This is the commit message open-policy-agent#20:

bundle/status: Include bundle type in status information

OPA has support for Delta Bundles. The status object already
contains valuable information such as last activation timestamp but
does not specify if the bundle was a canonical snapshot or delta.

This change updates the bundle.Status object to include the
bundle type string: either "snapshot" or "delta". This can be useful
for status endpoints to differentiate between the bundle types.

Issue: 4477

Signed-off-by: Bryan Fulton <bryan@styra.com>

# This is the commit message open-policy-agent#21:

ast+topdown+planner: replacement of non-built-in functions via 'with' (open-policy-agent#4616)

Follow-up to open-policy-agent#4540

We can now mock functions that are user-defined:

    package test

    f(_) = 1 {
        input.x = "x"
    }
    p = y {
        y := f(1) with f as 2
    }

...following the same scoping rules as laid out for built-in mocks.
The replacement can be a value (replacing all calls), or a built-in,
or another non-built-in function.

Also addresses bugs in the previous slice:
* topdown/evalCall: account for empty rules result from indexer
* topdown/eval: capture value replacement in PE could panic

Note: in PE, we now drop 'with' for function mocks of any kind:

These are always fully replaced in the saved support modules, so
this should be OK.

When keeping them, we'd also have to either copy the existing definitions
into the support module; or create a function stub in it.

Fixes open-policy-agent#4449.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#22:

format: keep whitespaces for multiple indented same-line withs (open-policy-agent#4635)

Fixes open-policy-agent#4634.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#23:

downloader: support for downloading bundles from an OCI registry (open-policy-agent#4558)

Initial support for open-policy-agent#4518.

Configuration uses the 'services' config for registries, via the "type: oci" field.
Bundles configured to pull from that service will then use OCI.

```
services:
  ghcr-registry:
    url: https://ghcr.io
    type: oci
bundles:
  authz:
    service: ghcr-registry
    resource: ghcr.io/${ORGANIZATION}/${REPOSITORY}:${TAG}
    persist: true
    polling:
      min_delay_seconds: 60
      max_delay_seconds: 120
persistence_directory: ${PERSISTENCE_PATH}
```

Service credentials are supported: if you want to pull from a private registry,
use
```
services:
  ghcr-registry:
    url: https://ghcr.io
    type: oci
    credentials:
      bearer:
        token: ${GH_PAT}
```

If no `persistence_directory` is configured, the data is stored in a directory under /tmp.

See docs/devel/OCI.md for manual steps to test this feature with some
OCI registry (like ghcr.io).

Signed-off-by: carabasdaniel <dani@aserto.com>
# This is the commit message open-policy-agent#24:

Prepare v0.40.0 Release (open-policy-agent#4631)

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#25:

Prepare v0.41.0 development (open-policy-agent#4636)

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#26:

docs: Adding example for `rego.metadata.role()` usage (open-policy-agent#4640)

Signed-off-by: Johan Fylling <johan.dev@fylling.se>
# This is the commit message open-policy-agent#27:

build(deps): bump oras.land/oras-go from 1.1.0 to 1.1.1 (open-policy-agent#4643)

Bumps [oras.land/oras-go](https://github.com/oras-project/oras-go) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/oras-project/oras-go/releases)
- [Commits](oras-project/oras-go@v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: oras.land/oras-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# This is the commit message open-policy-agent#28:

build(deps): bump OpenTelemetry 1.6.3 -> 1.7.0 (open-policy-agent#4649)

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.7.0
https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.7.0

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#29:

build(deps): bump github.com/containerd/containerd from 1.6.2 to 1.6.3 (open-policy-agent#4654)

Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.2 to 1.6.3.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](containerd/containerd@v1.6.2...v1.6.3)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# This is the commit message open-policy-agent#30:

Update k8s examples to the latest schema (open-policy-agent#4655)

Signed-off-by: Víctor Martínez Bevià <vicmarbev@gmail.com>
# This is the commit message open-policy-agent#31:

Fix incorrect padding claims (open-policy-agent#4657)

Signed-off-by: Anders Eknert <anders@eknert.com>
# This is the commit message open-policy-agent#32:

build(deps): bump github.com/containerd/containerd from 1.6.3 to 1.6.4 (open-policy-agent#4662)

Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.3 to 1.6.4.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](containerd/containerd@v1.6.3...v1.6.4)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# This is the commit message open-policy-agent#33:

build(deps): bump docker/setup-qemu-action from 1 to 2 (open-policy-agent#4668)


# This is the commit message open-policy-agent#34:

build(deps): bump docker/setup-buildx-action from 1 to 2 (open-policy-agent#4669)

Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 1 to 2.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](docker/setup-buildx-action@v1...v2)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# This is the commit message open-policy-agent#35:

build(deps): github.com/bytecodealliance/wasmtime-go 0.35.0 -> 0.36.0 (open-policy-agent#4652)

* build(deps): bump wasmtime-go: 0.35.0 -> 0.36.0
* internal/wasm: adapt to using epoch-based interruption

Looks like we don't get frames for this.

Also, there is currentlty no better way than comparing the message,
as the trap code isn't surfaced (yet).

Fixes open-policy-agent#4663.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
# This is the commit message open-policy-agent#36:

ecosystem: Add Sansshell (open-policy-agent#4674)


Signed-off-by: James Chacon <james.chacon@snowflake.com>
# This is the commit message open-policy-agent#37:

topdown: Add units.parse builtin (open-policy-agent#4676)

This function works on all base decimal and binary SI units of the set:

    m, K/Ki, M/Mi, G/Gi, T/Ti, P/Pi, and E/Ei

Note: Unlike `units.parse_bytes`, this function is case sensitive.

Fixes open-policy-agent#1802.

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
# This is the commit message open-policy-agent#38:

docs/contrib-code: Add capabilities step to built-in functions tutorial (open-policy-agent#4677)

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants