Missing `juju_model` in alert_expr causes confusing alerts #182

przemeklal · 2025-02-05T16:19:49Z

Bug Description

Example nrpe alert rule:

name: CheckVaultHealthNrpeAlert
expr: avg_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="vault/0"}[10m]) == 1)
labels:
juju_application: vault
juju_model: kubernetes-dev
juju_unit: vault/0
nrpe_application: nrpe
nrpe_unit: nrpe/8
severity: {{ if eq $value 0.0 -}} info {{- else if eq $value 1.0 -}} warning {{- else if eq $value 2.0 -}} critical {{- else if eq $value 3.0 -}} error {{- end }}
annotations:
description: Check provided by nrpe_exporter in model {{ $labels.juju_model }} is failing.
Failing check = {{ $labels.command }}
Unit = {{ $labels.juju_unit }}
Value = {{ $value }}
Legend:
  - StatusOK       = 0
  - StatusWarning  = 1
  - StatusCritical = 2
  - StatusUnknown  = 3
summary: Unit {{ $labels.juju_unit }}: {{ $labels.command }} {{ $labels.severity }}.

This alert rule will trigger an alert for vault/0 regardless of in what model it is failing.

Now, we have multiple models, e.g. kubernetes-prod with vault/0 deployed there as well. We had a case of vault/0 failing in kubernetes-dev but since these generic alert rules were created both for -dev and -prod, both have fired resulting in:

multiple alerts for one failed units
each alert added it's own juju_label model but the label was not matching the problematic unit -- instead it was the label of where cos-proxy that created this alert rule is located

To Reproduce

model A:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> single COS prometheus

model B:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> the same COS prometheus

Vault is just an example, all NRPE alerts are affected with this kind of topology.

You'll see two identical NRPE rules created for both models so even if only one unit in one model fails, alerts for both models will fire, one with a "sensible" juju_model label, and the other one with the label of the wrong model.

Environment

cos-proxy latest/stable 92

Relevant log output

n/a - it's a design issue

Additional context

No response

The text was updated successfully, but these errors were encountered:

przemeklal added Status: Triage Type: Bug labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing `juju_model` in alert_expr causes confusing alerts #182

Missing `juju_model` in alert_expr causes confusing alerts #182

przemeklal commented Feb 5, 2025

Missing juju_model in alert_expr causes confusing alerts #182

Missing juju_model in alert_expr causes confusing alerts #182

Comments

przemeklal commented Feb 5, 2025

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

Missing `juju_model` in alert_expr causes confusing alerts #182

Missing `juju_model` in alert_expr causes confusing alerts #182