Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing juju_model in alert_expr causes confusing alerts #182

Open
przemeklal opened this issue Feb 5, 2025 · 0 comments
Open

Missing juju_model in alert_expr causes confusing alerts #182

przemeklal opened this issue Feb 5, 2025 · 0 comments

Comments

@przemeklal
Copy link
Member

Bug Description

Example nrpe alert rule:

name: CheckVaultHealthNrpeAlert
expr: avg_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="vault/0"}[10m]) == 1)
labels:
juju_application: vault
juju_model: kubernetes-dev
juju_unit: vault/0
nrpe_application: nrpe
nrpe_unit: nrpe/8
severity: {{ if eq $value 0.0 -}} info {{- else if eq $value 1.0 -}} warning {{- else if eq $value 2.0 -}} critical {{- else if eq $value 3.0 -}} error {{- end }}
annotations:
description: Check provided by nrpe_exporter in model {{ $labels.juju_model }} is failing.
Failing check = {{ $labels.command }}
Unit = {{ $labels.juju_unit }}
Value = {{ $value }}
Legend:
  - StatusOK       = 0
  - StatusWarning  = 1
  - StatusCritical = 2
  - StatusUnknown  = 3
summary: Unit {{ $labels.juju_unit }}: {{ $labels.command }} {{ $labels.severity }}.

This alert rule will trigger an alert for vault/0 regardless of in what model it is failing.

Now, we have multiple models, e.g. kubernetes-prod with vault/0 deployed there as well. We had a case of vault/0 failing in kubernetes-dev but since these generic alert rules were created both for -dev and -prod, both have fired resulting in:

  • multiple alerts for one failed units
  • each alert added it's own juju_label model but the label was not matching the problematic unit -- instead it was the label of where cos-proxy that created this alert rule is located

To Reproduce

model A:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> single COS prometheus

model B:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> the same COS prometheus

Vault is just an example, all NRPE alerts are affected with this kind of topology.

You'll see two identical NRPE rules created for both models so even if only one unit in one model fails, alerts for both models will fire, one with a "sensible" juju_model label, and the other one with the label of the wrong model.

Environment

cos-proxy latest/stable 92

Relevant log output

n/a - it's a design issue

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant