[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

jgagnon44 · 2022-11-23T12:45:21Z

I have a 4 node K8s cluster set up via kubeadm on a local VM cluster. I am using the following:

Kubernetes 1.24
Helm 3.10.0
kube-prometheus-stack Helm chart 41.7.4 (app version 0.60.1)

When I go into either Prometheus or Alertmanager, there are many alerts that are always firing. Another thing to note is that Alertmanager "cluster status" is reporting as "disabled". Not sure what bearing (if any) that may have on this. I have not added any new alerts of my own - everything was presumably deployed with the Helm chart.

I do not understand what these alerts are triggering for other than what I can glean from their names. It does not seem a good thing that these alerts should be firing. Either there is something seriously wrong with the cluster or something is poorly configured in the alerting configuration of the Helm chart. I'm leaning toward the second case, but will admit, I really don't know.

Here is a listing of the firing alerts, along with label info:

etcdMembersDown
    alertname=etcdMembersDown, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
etcdInsufficientMembers
    alertname=etcdInsufficientMembers, endpoint=http-metrics, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
TargetDown
    alertname=TargetDown, job=kube-scheduler, namespace=kube-system, service=prometheus-stack-kube-prom-kube-scheduler, severity=warning
    alertname=TargetDown, job=kube-etcd, namespace=kube-system, service=prometheus-stack-kube-prom-kube-etcd, severity=warning
    alertname=TargetDown, job=kube-proxy, namespace=kube-system, service=prometheus-stack-kube-prom-kube-proxy, severity=warning
    alertname=TargetDown, job=kube-controller-manager, namespace=kube-system, service=prometheus-stack-kube-prom-kube-controller-manager, severity=warning
KubePodNotReady
    alertname=KubePodNotReady, namespace=monitoring, pod=prometheus-stack-grafana-759774797c-r44sb, severity=warning
KubeDeploymentReplicasMismatch
    alertname=KubeDeploymentReplicasMismatch, container=kube-state-metrics, deployment=prometheus-stack-grafana, endpoint=http, instance=192.168.42.19:8080, job=kube-state-metrics, namespace=monitoring, pod=prometheus-stack-kube-state-metrics-848f74474d-gp6pw, service=prometheus-stack-kube-state-metrics, severity=warning
KubeControllerManagerDown
    alertname=KubeControllerManagerDown, severity=critical
KubeProxyDown
    alertname=KubeProxyDown, severity=critical
KubeSchedulerDown
    alertname=KubeSchedulerDown, severity=critical

Here is my values.yaml:

defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    configReloaders: true
    general: true
    k8s: true
    kubeApiserverAvailability: true
    kubeApiserverBurnrate: true
    kubeApiserverHistogram: true
    kubeApiserverSlos: true
    kubeControllerManager: true
    kubelet: true
    kubeProxy: true
    kubePrometheusGeneral: true
    kubePrometheusNodeRecording: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    kubeSchedulerAlerting: true
    kubeSchedulerRecording: true
    kubeStateMetrics: true
    network: true
    node: true
    nodeExporterAlerting: true
    nodeExporterRecording: true
    prometheus: true
    prometheusOperator: true

prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - prometheus.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific

grafana:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.<hidden>
    path: /
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - alerts.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific
  config:
    global:
      slack_api_url: '<hidden>'
    route:
      receiver: "slack-default"
      group_by:
        - alertname
        - cluster
        - service
      group_wait: 30s
      group_interval: 5m # 5m
      repeat_interval: 2h # 4h
      routes:
        - receiver: "slack-warn-critical"
          matchers:
            - severity =~ "warning|critical"
          continue: true
    receivers:
      - name: "null"
      - name: "slack-default"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"
      - name: "slack-warn-critical"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"

  kubeControllerManager:
    service:
      enabled: true
      ports:
        http: 10257
      targetPorts:
        http: 10257
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

  kubeEtcd:
    serviceMonitor:
      scheme: https
      servername: <do I need it - don't know what this should be>
      cafile: <do I need it - don't know what this should be>
      certFile: <do I need it - don't know what this should be>
      keyFile: <do I need it - don't know what this should be>

  kubeProxy:
    serviceMonitor:
      https: true

  kubeScheduler:
    service:
      enabled: true
      ports:
        http: 10259
      targetPorts:
        http: 10259
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

Is there something wrong with this configuration? Are there any Kubernetes objects that might be missing or misconfigured? It seems very odd that one could install this Helm chart and experience this many "failures". Is there perhaps, a major problem with my cluster? I would think that if there was really something wrong with etcd, the kube-scheduler or kube-proxy that I would experience problems everywhere, but I am not.

If there is any other information I can pull from the cluster or related artifacts that might help, let me know and I will include them.

Here are some examples of the alerts:

Here's another interesting piece of the picture. I opened Prometheus and went to the targets tab. Below is an example of what I found. All of the unhealthy targets have this type of problem.

Seems like a security issue, probably certificate information is missing. If that is true, how do I fix this?

The text was updated successfully, but these errors were encountered:

stale · 2022-12-23T19:28:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2023-01-16T22:46:20Z

This issue is being automatically closed due to inactivity.

moxli · 2023-03-16T09:32:32Z

@jgagnon44 did you manage to find a solution for this issue?

Yashwant4u · 2023-12-04T12:05:26Z

Hi Team,
kubeclient certificate expiration alert is firing even it has validity and cluster was setup on onprem with kubeadm. And after renewal of kubeadm certs renewal all getting this alert firing and for some masters getting it before the certs internal components certificates.

elisaado · 2024-02-12T23:29:17Z

We are facing the same issue here, I will investigate when I have some free time

sebastiangaiser · 2024-06-28T15:31:37Z

@jkroepke I saw you working around at #4460. Could you maybe please reopen this issue. I think this issue is still valid and should be easily solvable by adding the pod label to the etcdInsufficientMembers alert like described here: https://github.com/etcd-io/etcd/blob/1c22e7b36bc5d8543f1646212f2960f9fe503b8c/contrib/mixin/config.libsonnet#L13

I closed my previous PR because I recognized that the alerts are getting generated...

… alerts on downtime of one etcd member (prometheus-community#2720) Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>

) * added "pod" prometheus label to etcd alerts to prevent false positive alerts on downtime of one etcd member (#2720) Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com> * update chart.yaml Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com> * added reference Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com> --------- Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>

jgagnon44 changed the title ~~Prometheus/Alertmanager - unexpected alerts firing - Why?~~ [kube-prometheus-stack] - unexpected alerts firing - Why? Nov 23, 2022

jgagnon44 changed the title ~~[kube-prometheus-stack] - unexpected alerts firing - Why?~~ [kube-prometheus-stack] - Unexpected alerts firing - Why? Nov 23, 2022

jgagnon44 changed the title ~~[kube-prometheus-stack] - Unexpected alerts firing - Why?~~ [kube-prometheus-stack] Unexpected alerts firing - Why? Nov 23, 2022

stale bot added the lifecycle/stale label Dec 23, 2022

stale bot closed this as completed Jan 16, 2023

sebastiangaiser mentioned this issue Nov 21, 2023

[kube-prometheus-stack] fix missing pod label for etcdInsufficientMem… #4031

Closed

3 tasks

jusch23 mentioned this issue Sep 13, 2024

[kube-prometheus-stack] add prometheus label "pod" to etcd instance labels #4853

Closed

3 tasks

jusch23 mentioned this issue Sep 30, 2024

[kube-prometheus-stack] add prometheus label "pod" to etcd alerts #4881

Closed

3 tasks

jusch23 mentioned this issue Oct 2, 2024

[kube-prometheus-stack] add prometheus label "pod" to etcd alerts #4891

Merged

3 tasks

jkroepke reopened this Oct 7, 2024

stale bot removed the lifecycle/stale label Oct 7, 2024

QuentinBisson closed this as completed in #4891 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

jgagnon44 commented Nov 23, 2022 •

edited

Loading

stale bot commented Dec 23, 2022

stale bot commented Jan 16, 2023

moxli commented Mar 16, 2023

Yashwant4u commented Dec 4, 2023

elisaado commented Feb 12, 2024

sebastiangaiser commented Jun 28, 2024

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

Comments

jgagnon44 commented Nov 23, 2022 • edited Loading

stale bot commented Dec 23, 2022

stale bot commented Jan 16, 2023

moxli commented Mar 16, 2023

Yashwant4u commented Dec 4, 2023

elisaado commented Feb 12, 2024

sebastiangaiser commented Jun 28, 2024

jgagnon44 commented Nov 23, 2022 •

edited

Loading