You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
name: CheckVaultHealthNrpeAlert
expr: avg_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_vault_health",juju_unit="vault/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="vault/0"}[10m]) == 1)
labels:
juju_application: vault
juju_model: kubernetes-dev
juju_unit: vault/0
nrpe_application: nrpe
nrpe_unit: nrpe/8
severity: {{ if eq $value 0.0 -}} info {{- else if eq $value 1.0 -}} warning {{- else if eq $value 2.0 -}} critical {{- else if eq $value 3.0 -}} error {{- end }}
annotations:
description: Check provided by nrpe_exporter in model {{ $labels.juju_model }} is failing.
Failing check = {{ $labels.command }}
Unit = {{ $labels.juju_unit }}
Value = {{ $value }}
Legend:
- StatusOK = 0
- StatusWarning = 1
- StatusCritical = 2
- StatusUnknown = 3
summary: Unit {{ $labels.juju_unit }}: {{ $labels.command }} {{ $labels.severity }}.
This alert rule will trigger an alert for vault/0 regardless of in what model it is failing.
Now, we have multiple models, e.g. kubernetes-prod with vault/0 deployed there as well. We had a case of vault/0 failing in kubernetes-dev but since these generic alert rules were created both for -dev and -prod, both have fired resulting in:
multiple alerts for one failed units
each alert added it's own juju_label model but the label was not matching the problematic unit -- instead it was the label of where cos-proxy that created this alert rule is located
To Reproduce
model A:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> single COS prometheus
model B:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> the same COS prometheus
Vault is just an example, all NRPE alerts are affected with this kind of topology.
You'll see two identical NRPE rules created for both models so even if only one unit in one model fails, alerts for both models will fire, one with a "sensible" juju_model label, and the other one with the label of the wrong model.
Environment
cos-proxy latest/stable 92
Relevant log output
n/a - it's a design issue
Additional context
No response
The text was updated successfully, but these errors were encountered:
Bug Description
Example nrpe alert rule:
This alert rule will trigger an alert for vault/0 regardless of in what model it is failing.
Now, we have multiple models, e.g. kubernetes-prod with vault/0 deployed there as well. We had a case of vault/0 failing in kubernetes-dev but since these generic alert rules were created both for -dev and -prod, both have fired resulting in:
To Reproduce
model A:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> single COS prometheus
model B:
vault/0 <-> nrpe <-> cos-proxy <---crm relation---> the same COS prometheus
Vault is just an example, all NRPE alerts are affected with this kind of topology.
You'll see two identical NRPE rules created for both models so even if only one unit in one model fails, alerts for both models will fire, one with a "sensible" juju_model label, and the other one with the label of the wrong model.
Environment
cos-proxy latest/stable 92
Relevant log output
n/a - it's a design issue
Additional context
No response
The text was updated successfully, but these errors were encountered: