Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some metrics are missing. #3

Closed
reefland opened this issue Jun 24, 2022 · 24 comments · Fixed by #16
Closed

Some metrics are missing. #3

reefland opened this issue Jun 24, 2022 · 24 comments · Fixed by #16
Assignees
Labels
bug Something isn't working

Comments

@reefland
Copy link
Contributor

Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:

grafana_missing_metrics

When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total. When I look in my Prometheus it does not have image=, here is a random one that was on the top of the query:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}

I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.

If you have any suggestions, be much appreciated.

@reefland
Copy link
Contributor Author

reefland commented Jun 24, 2022

If I modify the JSON query to be pod!="" instead of image!="" it renders data, not sure it is equal to what you intended.

grafana_new_query_metrics

@reefland
Copy link
Contributor Author

Lastly, I update the 4 panel's JSON query to be pod!="" and it looks good, but I feel my setup is missing something that would provide the image key.

grafana_updated_query_metrics

@dotdc dotdc self-assigned this Jun 24, 2022
@dotdc dotdc added the bug Something isn't working label Jun 24, 2022
@dotdc
Copy link
Owner

dotdc commented Jun 24, 2022

Hi @reefland,

I'm using them with kube-prometheus-stack and they work well with the image label.
I will try to reproduce on k3s next week to see if data is the same with the pod label.
Will update the issue to let you know.

Thank you for the feedback!

@reefland
Copy link
Contributor Author

I'm using kube-prometheus-stack as well. k3s comes with containerd, but its a limited version. I install an external containerd/runc from Ubuntu 20.04.4 LTS:

containerd/focal-updates,focal-security,now 1.5.9-0ubuntu1~20.04.4 amd64 [installed]
  daemon to control runC

runc/focal-updates,now 1.1.0-0ubuntu1~20.04.1 amd64 [installed,automatic]
  Open Container Project - runtime

To get k3s to use a different containerd, you just add a parameter to point it to the alternate socket.

--container-runtime-endpoint=unix:///run/containerd/containerd.sock'

(The built-in containerd overlay filesystem does not support ZFS filesystem snapshotter so I can't even test it.)

@dotdc
Copy link
Owner

dotdc commented Jun 26, 2022

Just had a quick look this morning and I think the image label is missing to reduce cardinality.
Will need to dig a little bit more, but I think using the container label instead of the image is the best option here.

@reefland Can you try replacing the image!="" with container!=""and tell me if it works on your setup?

@reefland
Copy link
Contributor Author

This returns empty set:

sum(rate(container_cpu_usage_seconds_total{container!=""}[2m])) by (namespace)

This returns data:

sum(rate(container_cpu_usage_seconds_total{pod!=""}[2m])) by (namespace)

{namespace="longhorn-system"} | 0.3802682252044597
{namespace="unifi"} | 0.010644037623491361
{namespace="democratic-csi"} | 0.07559476845975506
{namespace="monitoring"} | 0.3340142603623991
{namespace="kube-system"} | 0.024392617969317708
{namespace="cert-manager"} | 0.004427970345607981
{namespace="mosquitto"} | 0.002106486317303942
{namespace="argocd"} | 0.10412260998344586
{namespace="traefik"} | 0.007761231914784795
{namespace="vpa"} | 0.0011662568742333156
{namespace="goldilocks"} | 0.00034527230630960895

@dotdc
Copy link
Owner

dotdc commented Jun 27, 2022

Can you check if you drop some labels in your prometheus/kube-prometheus-stack configuration/values?
If not, can you share more details on your setup, especially kube-prometheus-stack version.

@reefland
Copy link
Contributor Author

Using Chart 36.2.0 of kube-prometheus-stack. References image: 'quay.io/prometheus/prometheus:v2.36.1'

I've haven't done done relabeling or label drops (not sure how to even do that yet). That should all be "default" settings.

My Prometheus settings for the Helm values.yaml are:

      prometheusOperator:
        enabled: true

      # Prometheus values

      prometheus:
        enabled: true
        prometheusSpec:
          storageSpec:
            volumeClaimTemplate:
              spec:
                storageClassName: freenas-iscsi-csi
                accessModes: 
                  - ReadWriteOnce
                resources:
                  requests:
                    storage: 50Gi

          retention: 21d
          externalUrl: /prometheus

Grafana / Alertmanager settings left out for brevity. As K3s does not deploy everything as a pod, I have some setup in the values.yaml on how to reach them:

     kubeApiServer:
        enabled: true

      kubelet:
        enabled: true
        namespace: kube-system
        resource: true

      kubeControllerManager:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 10257
          targetPort: 10257
        serviceMonitor:
          enabled: true
          https: true
          insecureSkipVerify: true

      coreDns:
        enabled: true

      kubeScheduler:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 10259
          targetPort: 10259
        serviceMonitor:
          enabled: true
          https: true
          insecureSkipVerify: true

      kubeProxy:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217

      kubeEtcd:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 2381
          targetPort: 2381

      kubeStateMetrics:
        enabled: true

@i5Js
Copy link

i5Js commented Jul 1, 2022

My cluster is built with VMs and K8s, and I'm missing some graphics too.

Ex:
Screenshot 2022-07-01 at 20 53 23

@dotdc
Copy link
Owner

dotdc commented Jul 1, 2022

This issue is related to k3s, I still need to reproduce. (sorry @reefland btw)
@i5Js You probably need to install the node_exporter to get the missing metrics.

@i5Js
Copy link

i5Js commented Jul 1, 2022

@dotdc should I open a new ticket? Because I have it installed.

prometheus-node-exporter-ktzhd                   1/1     Running   0          10h
prometheus-node-exporter-mq6m9                   1/1     Running   0          10h

prometheus-node-exporter        ClusterIP  <ip>    <none>        9100/TCP   10h

Anyway, I'm going to investigate it further.

@i5Js
Copy link

i5Js commented Jul 2, 2022

I've created a new ticket.

@reefland
Copy link
Contributor Author

I upgraded to kube-prometheus-stack-37.2.0 and pretty much every work around I did to get around my original issue no longer work. Tried your unedited versions, same issue.

I get an empty query result just trying to look at container_cpu_usage_seconds_total, curious if you have tried the new version.

@dotdc
Copy link
Owner

dotdc commented Jul 14, 2022

Hi @reefland,
Shouldn't be a problem for container_cpu_usage_seconds_total but 37.x introduced a breaking change in this PR

From 36.x to 37.x
This includes some default metric relabelings for cAdvisor and apiserver metrics to reduce cardinality. If you do not want these defaults, you will need to override the kubeApiServer.metricRelabelings and or kubelet.cAdvisorMetricRelabelings.

Anyway, something seems to block you access to the cAdvisor metrics, check the servicemonitors, servicemonitor selectors, access to the kubernetes server api...

Let me know

@reefland
Copy link
Contributor Author

All my targets are up. None are reporting an error.

@dotdc
Copy link
Owner

dotdc commented Jul 14, 2022

Can you try to deploy with an empty cAdvisorMetricRelabelings: []
Just to overwrite a possible side effect of prometheus-community/helm-charts@f18afff#diff-c0fdbc5c26d2f602485f168b5a55814cd73bd3347907c5097395120d64c2f445L958

@reefland
Copy link
Contributor Author

Yeah, that gets me working again. I'll go through them one at a time and see which ones breaks it.

@reefland
Copy link
Contributor Author

I was able to add each of these back with no impact that I could find to any of my dashboards:

  # Drop less useful container CPU metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)'
      # Drop less useful container / always zero filesystem metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)'
      # Drop less useful / always zero container memory metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_memory_(mapped_file|swap)'
      # Drop less useful container process metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_(file_descriptors|tasks_state|threads_max)'
      # Drop container spec metrics that overlap with kube-state-metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_spec.*'

The last two, I'm trying to figure out the PromQL to use in Prometheus to review the metrics impacted:

      # Drop cgroup metrics with no pod.
      - sourceLabels: [id, pod]
        action: drop
        regex: '.+;'
      # Drop cgroup metrics with no container.
      - sourceLabels: [id, container]
        action: drop
        regex: '.+;'

Would that be something like {id!='',pod=''} ??? I think that means anything without a pod=, If so that is thousands of metrics being dropped like container_blkio_device_usage_total, container_cpu_cfs_periods_total, container_cpu_usage_seconds_total, container_fs_inodes_free, container_fs_inodes_total, container_fs_limit_bytes, dozens more.

@dotdc
Copy link
Owner

dotdc commented Jul 15, 2022

I did the same tests this afternoon and had the same results. I'm opening an issue to discuss theses two rules because they are way too restrictive to be enabled by default in my opinion.

@dotdc
Copy link
Owner

dotdc commented Jul 15, 2022

Issue opened : prometheus-community/helm-charts#2279

@SuperQ
Copy link
Contributor

SuperQ commented Jul 16, 2022

CPU by node should be derived from node_cpu_seconds_total, not container_cpu_usage_seconds_total.

SuperQ added a commit to SuperQ/grafana-dashboards-kubernetes that referenced this issue Jul 16, 2022
Use the node_exporter CPU metrics to get system level data.

Fixes: dotdc#3

Signed-off-by: SuperQ <superq@gmail.com>
@dotdc dotdc closed this as completed in #16 Jul 18, 2022
dotdc pushed a commit that referenced this issue Jul 18, 2022
Use the node_exporter CPU metrics to get system level data.

Fixes: #3

Signed-off-by: SuperQ <superq@gmail.com>
@zentavr
Copy link

zentavr commented Aug 9, 2023

I have the same issue with bitnami/kube-prometheus helm chart which installs prometheus.

@zentavr
Copy link

zentavr commented Aug 9, 2023

Seems like the issue with docker, cri-docker and cAdvisor.
It just does not populate the image label.

kubectl get --raw /api/v1/nodes/worker03.k8s.cti.local/proxy/metrics/cadvisor
...
...
container_threads{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode2654b15_9792_4666_b729_14f2c4315817.slice",image="",name="",namespace="ingress-nginx",pod="ingress-nginx-controller-74dd99b856-5tszr"} 283 1691549711297
...
...

@zentavr
Copy link

zentavr commented Aug 9, 2023

The workaround is found here

dragonknight919 pushed a commit to dragonknight919/grafana-dashboards-kubernetes that referenced this issue Sep 15, 2023
Use the node_exporter CPU metrics to get system level data.

Fixes: dotdc/grafana-dashboards-kubernetes#3

Signed-off-by: SuperQ <superq@gmail.com>
Boss930129 added a commit to Boss930129/kubernetes-grafana-dashboards that referenced this issue Jun 11, 2024
Use the node_exporter CPU metrics to get system level data.

Fixes: dotdc/grafana-dashboards-kubernetes#3

Signed-off-by: SuperQ <superq@gmail.com>
ngud-119 added a commit to ngud-119/grafana-kubernetes that referenced this issue Sep 17, 2024
Use the node_exporter CPU metrics to get system level data.

Fixes: dotdc/grafana-dashboards-kubernetes#3

Signed-off-by: SuperQ <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants