Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

Closed
jgagnon44 opened this issue Nov 23, 2022 · 6 comments · Fixed by #4891
Closed

[kube-prometheus-stack] Unexpected alerts firing - Why? #2720

jgagnon44 opened this issue Nov 23, 2022 · 6 comments · Fixed by #4891

Comments

@jgagnon44
Copy link

jgagnon44 commented Nov 23, 2022

I have a 4 node K8s cluster set up via kubeadm on a local VM cluster. I am using the following:

  • Kubernetes 1.24
  • Helm 3.10.0
  • kube-prometheus-stack Helm chart 41.7.4 (app version 0.60.1)

When I go into either Prometheus or Alertmanager, there are many alerts that are always firing. Another thing to note is that Alertmanager "cluster status" is reporting as "disabled". Not sure what bearing (if any) that may have on this. I have not added any new alerts of my own - everything was presumably deployed with the Helm chart.

I do not understand what these alerts are triggering for other than what I can glean from their names. It does not seem a good thing that these alerts should be firing. Either there is something seriously wrong with the cluster or something is poorly configured in the alerting configuration of the Helm chart. I'm leaning toward the second case, but will admit, I really don't know.

Here is a listing of the firing alerts, along with label info:

etcdMembersDown
    alertname=etcdMembersDown, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
etcdInsufficientMembers
    alertname=etcdInsufficientMembers, endpoint=http-metrics, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
TargetDown
    alertname=TargetDown, job=kube-scheduler, namespace=kube-system, service=prometheus-stack-kube-prom-kube-scheduler, severity=warning
    alertname=TargetDown, job=kube-etcd, namespace=kube-system, service=prometheus-stack-kube-prom-kube-etcd, severity=warning
    alertname=TargetDown, job=kube-proxy, namespace=kube-system, service=prometheus-stack-kube-prom-kube-proxy, severity=warning
    alertname=TargetDown, job=kube-controller-manager, namespace=kube-system, service=prometheus-stack-kube-prom-kube-controller-manager, severity=warning
KubePodNotReady
    alertname=KubePodNotReady, namespace=monitoring, pod=prometheus-stack-grafana-759774797c-r44sb, severity=warning
KubeDeploymentReplicasMismatch
    alertname=KubeDeploymentReplicasMismatch, container=kube-state-metrics, deployment=prometheus-stack-grafana, endpoint=http, instance=192.168.42.19:8080, job=kube-state-metrics, namespace=monitoring, pod=prometheus-stack-kube-state-metrics-848f74474d-gp6pw, service=prometheus-stack-kube-state-metrics, severity=warning
KubeControllerManagerDown
    alertname=KubeControllerManagerDown, severity=critical
KubeProxyDown
    alertname=KubeProxyDown, severity=critical
KubeSchedulerDown
    alertname=KubeSchedulerDown, severity=critical

Here is my values.yaml:

defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    configReloaders: true
    general: true
    k8s: true
    kubeApiserverAvailability: true
    kubeApiserverBurnrate: true
    kubeApiserverHistogram: true
    kubeApiserverSlos: true
    kubeControllerManager: true
    kubelet: true
    kubeProxy: true
    kubePrometheusGeneral: true
    kubePrometheusNodeRecording: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    kubeSchedulerAlerting: true
    kubeSchedulerRecording: true
    kubeStateMetrics: true
    network: true
    node: true
    nodeExporterAlerting: true
    nodeExporterRecording: true
    prometheus: true
    prometheusOperator: true

prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - prometheus.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific

grafana:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.<hidden>
    path: /
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - alerts.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific
  config:
    global:
      slack_api_url: '<hidden>'
    route:
      receiver: "slack-default"
      group_by:
        - alertname
        - cluster
        - service
      group_wait: 30s
      group_interval: 5m # 5m
      repeat_interval: 2h # 4h
      routes:
        - receiver: "slack-warn-critical"
          matchers:
            - severity =~ "warning|critical"
          continue: true
    receivers:
      - name: "null"
      - name: "slack-default"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"
      - name: "slack-warn-critical"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"

  kubeControllerManager:
    service:
      enabled: true
      ports:
        http: 10257
      targetPorts:
        http: 10257
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

  kubeEtcd:
    serviceMonitor:
      scheme: https
      servername: <do I need it - don't know what this should be>
      cafile: <do I need it - don't know what this should be>
      certFile: <do I need it - don't know what this should be>
      keyFile: <do I need it - don't know what this should be>

  kubeProxy:
    serviceMonitor:
      https: true

  kubeScheduler:
    service:
      enabled: true
      ports:
        http: 10259
      targetPorts:
        http: 10259
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

Is there something wrong with this configuration? Are there any Kubernetes objects that might be missing or misconfigured? It seems very odd that one could install this Helm chart and experience this many "failures". Is there perhaps, a major problem with my cluster? I would think that if there was really something wrong with etcd, the kube-scheduler or kube-proxy that I would experience problems everywhere, but I am not.

If there is any other information I can pull from the cluster or related artifacts that might help, let me know and I will include them.

Here are some examples of the alerts:

image
image

Here's another interesting piece of the picture. I opened Prometheus and went to the targets tab. Below is an example of what I found. All of the unhealthy targets have this type of problem.

Seems like a security issue, probably certificate information is missing. If that is true, how do I fix this?

image

@jgagnon44 jgagnon44 changed the title Prometheus/Alertmanager - unexpected alerts firing - Why? [kube-prometheus-stack] - unexpected alerts firing - Why? Nov 23, 2022
@jgagnon44 jgagnon44 changed the title [kube-prometheus-stack] - unexpected alerts firing - Why? [kube-prometheus-stack] - Unexpected alerts firing - Why? Nov 23, 2022
@jgagnon44 jgagnon44 changed the title [kube-prometheus-stack] - Unexpected alerts firing - Why? [kube-prometheus-stack] Unexpected alerts firing - Why? Nov 23, 2022
@stale
Copy link

stale bot commented Dec 23, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale
Copy link

stale bot commented Jan 16, 2023

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Jan 16, 2023
@moxli
Copy link

moxli commented Mar 16, 2023

@jgagnon44 did you manage to find a solution for this issue?

@Yashwant4u
Copy link

Hi Team,
kubeclient certificate expiration alert is firing even it has validity and cluster was setup on onprem with kubeadm. And after renewal of kubeadm certs renewal all getting this alert firing and for some masters getting it before the certs internal components certificates.

@elisaado
Copy link

We are facing the same issue here, I will investigate when I have some free time

@sebastiangaiser
Copy link
Contributor

@jkroepke I saw you working around at #4460. Could you maybe please reopen this issue. I think this issue is still valid and should be easily solvable by adding the pod label to the etcdInsufficientMembers alert like described here: https://github.com/etcd-io/etcd/blob/1c22e7b36bc5d8543f1646212f2960f9fe503b8c/contrib/mixin/config.libsonnet#L13

I closed my previous PR because I recognized that the alerts are getting generated...

jusch23 added a commit to jusch23/helm-charts that referenced this issue Sep 30, 2024
… alerts

on downtime of one etcd member (prometheus-community#2720)

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>
jusch23 added a commit to jusch23/helm-charts that referenced this issue Oct 2, 2024
… alerts on downtime of one etcd member (prometheus-community#2720)

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>
@jkroepke jkroepke reopened this Oct 7, 2024
@stale stale bot removed the lifecycle/stale label Oct 7, 2024
QuentinBisson pushed a commit that referenced this issue Oct 8, 2024
)

* added "pod" prometheus label to etcd alerts to prevent false positive alerts on downtime of one etcd member (#2720)

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>

* update chart.yaml

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>

* added reference

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>

---------

Signed-off-by: Julian Schreiner <20794518+jusch23@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment