Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube prometheus default alerts issue #4872

Open
vijaymailb opened this issue Sep 23, 2024 · 0 comments
Open

Kube prometheus default alerts issue #4872

vijaymailb opened this issue Sep 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@vijaymailb
Copy link

vijaymailb commented Sep 23, 2024

Describe the bug a clear and concise description of what the bug is.

We are using kube prometheus stack version 61.3.2 with default prometheus rules for all its components. As we intend to use https://github.com/cloudflare/pint for linting and identify missing metrics, we found many default prometheus rule with linting issue.

What's your helm version?

61.3.2

What's your kubectl version?

1.28.11

Which chart?

https://github.com/prometheus-community/helm-charts/edit/kube-prometheus-stack-61.3.2/

What's the chart version?

61.3.2

What happened?

Following default prometheus rules have issue.

for example: pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterCrashlooping",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
expression:
  - alert: AlertmanagerClusterCrashlooping
    annotations:
      description: '{{ $value | humanizePercentage }} of Alertmanager instances within
        the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'
      runbook_url: https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclustercrashlooping
      summary: Half or more of the Alertmanager instances within the same cluster
        are crashlooping.
    expr: |-
      (
        count by (namespace,service,cluster) (
          changes(process_start_time_seconds{job="prometheus-stack-alertmanager",namespace="namespace1"}[10m]) > 4
        )
      /
        count by (namespace,service,cluster) (
          up{job="prometheus-stack-alertmanager",namespace="namespace1"}
        )
      )
      >= 0.5
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-etcd-test-f4c8-4208-a6b8-57da78332911.yaml",kind="alerting",name="etcdHighNumberOfLeaderChanges",owner="",problem="Template is using `job` label but `absent()` is not passing it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterCrashlooping",owner="",problem="`prom` Prometheus server at http://localhost:9090 has `process_start_time_seconds` metric with `job` label but there are no series matching `{job=\"prometheus-stack-alertmanager\"}` in the last 1w.",reporter="promql/series",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterDown",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_failed_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\", integration=\"\"}` if you want to match on all time series for `alertmanager_notifications_failed_total` without the `integration` label.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_failed_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\"}` if you want to match on all `integration` values.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\", integration=\"\"}` if you want to match on all time series for `alertmanager_notifications_total` without the `integration` label.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\"}` if you want to match on all `integration` values.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerConfigInconsistent",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-node-exporter.rules-4a1ae679-a8b5-4bbf-9bfd-ed4f42728e9f.yaml",kind="recording",name="instance:node_load1_per_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[container, endpoint, instance, job, namespace, node, pod, service]` != `[container, endpoint, instance, job, namespace, node, pod, receiver_opsgenie_admins, receiver_slack_cluster, service]`. Failing query: `node_load1{job=\"node-exporter\"} / instance:node_num_cpu:sum{job=\"node-exporter\"}`.",reporter="promql/vector_matching",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-kube-prometheus-node-recording.rules-7235730b-029a-4598-9d86-9729c424a8e2.yaml",kind="recording",name="cluster:node_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[receiver_opsgenie_admins, receiver_slack_cluster]` != `[]`. Failing query: `cluster:node_cpu:sum_rate5m / count(sum by (instance, cpu) (node_cpu_seconds_total))`.",reporter="promql/vector_matching",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-node-exporter.rules-4a1ae679-a8b5-4bbf-9bfd-ed4f42728e9f.yaml",kind="recording",name="instance:node_load1_per_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[container, endpoint, instance, job, namespace, node, pod, service]` != `[container, endpoint, instance, job, namespace, node, pod, receiver_opsgenie_admins, receiver_slack_cluster, service]`. Failing query: `node_load1{job=\"node-exporter\"} / instance:node_num_cpu:sum{job=\"node-exporter\"}`.",reporter="promql/vector_matching",severity="bug"} 1```

and almost all alerts under `https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kubernetes-apps.yaml`
```pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-kubernetes-apps-e679ee64-11ae-433d-820f-b5221857004e.yaml",kind="alerting",name="KubeStatefulSetUpdateNotRolledOut",owner="",problem="Unnecessary wildcard regexp, simply use `kube_statefulset_replicas{job=\"kube-state-metrics\"}` if you want to match on all `namespace` values.",reporter="promql/regexp",severity="bug"} 1```

All the above alerts having lint issue.

### What you expected to happen?

Prometheus default rules needs to be adjusted in order to get rid of linting errors.

### How to reproduce it?

Run pint as sidecar to the prometheus to get the linting alerts.

### Enter the changed values of values.yaml?

_No response_

### Enter the command that you execute and failing/misfunctioning.

It has nothing to do with helm chart as we need to adjust default prometheus rules.

### Anything else we need to know?

_No response_
@vijaymailb vijaymailb added the bug Something isn't working label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant