Some metrics are missing. #3

reefland · 2022-06-24T14:49:23Z

Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:

When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total. When I look in my Prometheus it does not have image=, here is a random one that was on the top of the query:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}

I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.

If you have any suggestions, be much appreciated.

The text was updated successfully, but these errors were encountered:

reefland · 2022-06-24T14:57:33Z

If I modify the JSON query to be pod!="" instead of image!="" it renders data, not sure it is equal to what you intended.

reefland · 2022-06-24T15:05:19Z

Lastly, I update the 4 panel's JSON query to be pod!="" and it looks good, but I feel my setup is missing something that would provide the image key.

dotdc · 2022-06-24T15:27:09Z

Hi @reefland,

I'm using them with kube-prometheus-stack and they work well with the image label.
I will try to reproduce on k3s next week to see if data is the same with the pod label.
Will update the issue to let you know.

Thank you for the feedback!

reefland · 2022-06-24T21:26:27Z

I'm using kube-prometheus-stack as well. k3s comes with containerd, but its a limited version. I install an external containerd/runc from Ubuntu 20.04.4 LTS:

containerd/focal-updates,focal-security,now 1.5.9-0ubuntu1~20.04.4 amd64 [installed]
  daemon to control runC

runc/focal-updates,now 1.1.0-0ubuntu1~20.04.1 amd64 [installed,automatic]
  Open Container Project - runtime

To get k3s to use a different containerd, you just add a parameter to point it to the alternate socket.

--container-runtime-endpoint=unix:///run/containerd/containerd.sock'

(The built-in containerd overlay filesystem does not support ZFS filesystem snapshotter so I can't even test it.)

dotdc · 2022-06-26T07:45:15Z

Just had a quick look this morning and I think the image label is missing to reduce cardinality.
Will need to dig a little bit more, but I think using the container label instead of the image is the best option here.

@reefland Can you try replacing the image!="" with container!=""and tell me if it works on your setup?

reefland · 2022-06-27T12:32:01Z

This returns empty set:

sum(rate(container_cpu_usage_seconds_total{container!=""}[2m])) by (namespace)

This returns data:

sum(rate(container_cpu_usage_seconds_total{pod!=""}[2m])) by (namespace)

{namespace="longhorn-system"} | 0.3802682252044597
{namespace="unifi"} | 0.010644037623491361
{namespace="democratic-csi"} | 0.07559476845975506
{namespace="monitoring"} | 0.3340142603623991
{namespace="kube-system"} | 0.024392617969317708
{namespace="cert-manager"} | 0.004427970345607981
{namespace="mosquitto"} | 0.002106486317303942
{namespace="argocd"} | 0.10412260998344586
{namespace="traefik"} | 0.007761231914784795
{namespace="vpa"} | 0.0011662568742333156
{namespace="goldilocks"} | 0.00034527230630960895

dotdc · 2022-06-27T12:55:05Z

Can you check if you drop some labels in your prometheus/kube-prometheus-stack configuration/values?
If not, can you share more details on your setup, especially kube-prometheus-stack version.

reefland · 2022-06-27T15:07:55Z

Using Chart 36.2.0 of kube-prometheus-stack. References image: 'quay.io/prometheus/prometheus:v2.36.1'

I've haven't done done relabeling or label drops (not sure how to even do that yet). That should all be "default" settings.

My Prometheus settings for the Helm values.yaml are:

      prometheusOperator:
        enabled: true

      # Prometheus values

      prometheus:
        enabled: true
        prometheusSpec:
          storageSpec:
            volumeClaimTemplate:
              spec:
                storageClassName: freenas-iscsi-csi
                accessModes: 
                  - ReadWriteOnce
                resources:
                  requests:
                    storage: 50Gi

          retention: 21d
          externalUrl: /prometheus

Grafana / Alertmanager settings left out for brevity. As K3s does not deploy everything as a pod, I have some setup in the values.yaml on how to reach them:

     kubeApiServer:
        enabled: true

      kubelet:
        enabled: true
        namespace: kube-system
        resource: true

      kubeControllerManager:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 10257
          targetPort: 10257
        serviceMonitor:
          enabled: true
          https: true
          insecureSkipVerify: true

      coreDns:
        enabled: true

      kubeScheduler:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 10259
          targetPort: 10259
        serviceMonitor:
          enabled: true
          https: true
          insecureSkipVerify: true

      kubeProxy:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217

      kubeEtcd:
        enabled: true
        endpoints:
          - 192.168.10.215
          - 192.168.10.216
          - 192.168.10.217
        service:
          enabled: true
          port: 2381
          targetPort: 2381

      kubeStateMetrics:
        enabled: true

i5Js · 2022-07-01T18:55:10Z

My cluster is built with VMs and K8s, and I'm missing some graphics too.

Ex:

dotdc · 2022-07-01T19:56:30Z

This issue is related to k3s, I still need to reproduce. (sorry @reefland btw)
@i5Js You probably need to install the node_exporter to get the missing metrics.

i5Js · 2022-07-01T20:41:04Z

@dotdc should I open a new ticket? Because I have it installed.

prometheus-node-exporter-ktzhd                   1/1     Running   0          10h
prometheus-node-exporter-mq6m9                   1/1     Running   0          10h

prometheus-node-exporter        ClusterIP  <ip>    <none>        9100/TCP   10h

Anyway, I'm going to investigate it further.

i5Js · 2022-07-02T08:12:17Z

I've created a new ticket.

reefland · 2022-07-14T19:35:58Z

I upgraded to kube-prometheus-stack-37.2.0 and pretty much every work around I did to get around my original issue no longer work. Tried your unedited versions, same issue.

I get an empty query result just trying to look at container_cpu_usage_seconds_total, curious if you have tried the new version.

dotdc · 2022-07-14T20:27:41Z

Hi @reefland,
Shouldn't be a problem for container_cpu_usage_seconds_total but 37.x introduced a breaking change in this PR

From 36.x to 37.x
This includes some default metric relabelings for cAdvisor and apiserver metrics to reduce cardinality. If you do not want these defaults, you will need to override the kubeApiServer.metricRelabelings and or kubelet.cAdvisorMetricRelabelings.

Anyway, something seems to block you access to the cAdvisor metrics, check the servicemonitors, servicemonitor selectors, access to the kubernetes server api...

Let me know

reefland · 2022-07-14T21:05:00Z

All my targets are up. None are reporting an error.

dotdc · 2022-07-14T21:10:07Z

Can you try to deploy with an empty cAdvisorMetricRelabelings: []
Just to overwrite a possible side effect of prometheus-community/helm-charts@f18afff#diff-c0fdbc5c26d2f602485f168b5a55814cd73bd3347907c5097395120d64c2f445L958

reefland · 2022-07-14T22:58:08Z

Yeah, that gets me working again. I'll go through them one at a time and see which ones breaks it.

reefland · 2022-07-15T13:00:37Z

I was able to add each of these back with no impact that I could find to any of my dashboards:

  # Drop less useful container CPU metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)'
      # Drop less useful container / always zero filesystem metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)'
      # Drop less useful / always zero container memory metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_memory_(mapped_file|swap)'
      # Drop less useful container process metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_(file_descriptors|tasks_state|threads_max)'
      # Drop container spec metrics that overlap with kube-state-metrics.
      - sourceLabels: [__name__]
        action: drop
        regex: 'container_spec.*'

The last two, I'm trying to figure out the PromQL to use in Prometheus to review the metrics impacted:

      # Drop cgroup metrics with no pod.
      - sourceLabels: [id, pod]
        action: drop
        regex: '.+;'
      # Drop cgroup metrics with no container.
      - sourceLabels: [id, container]
        action: drop
        regex: '.+;'

Would that be something like {id!='',pod=''} ??? I think that means anything without a pod=, If so that is thousands of metrics being dropped like container_blkio_device_usage_total, container_cpu_cfs_periods_total, container_cpu_usage_seconds_total, container_fs_inodes_free, container_fs_inodes_total, container_fs_limit_bytes, dozens more.

dotdc · 2022-07-15T13:16:43Z

I did the same tests this afternoon and had the same results. I'm opening an issue to discuss theses two rules because they are way too restrictive to be enabled by default in my opinion.

dotdc · 2022-07-15T13:50:28Z

Issue opened : prometheus-community/helm-charts#2279

SuperQ · 2022-07-16T15:41:45Z

CPU by node should be derived from node_cpu_seconds_total, not container_cpu_usage_seconds_total.

Use the node_exporter CPU metrics to get system level data. Fixes: dotdc#3 Signed-off-by: SuperQ <superq@gmail.com>

Use the node_exporter CPU metrics to get system level data. Fixes: #3 Signed-off-by: SuperQ <superq@gmail.com>

zentavr · 2023-08-09T01:43:35Z

I have the same issue with bitnami/kube-prometheus helm chart which installs prometheus.

zentavr · 2023-08-09T02:59:19Z

Seems like the issue with docker, cri-docker and cAdvisor.
It just does not populate the image label.

kubectl get --raw /api/v1/nodes/worker03.k8s.cti.local/proxy/metrics/cadvisor
...
...
container_threads{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode2654b15_9792_4666_b729_14f2c4315817.slice",image="",name="",namespace="ingress-nginx",pod="ingress-nginx-controller-74dd99b856-5tszr"} 283 1691549711297
...
...

zentavr · 2023-08-09T04:36:15Z

The workaround is found here

Use the node_exporter CPU metrics to get system level data. Fixes: dotdc/grafana-dashboards-kubernetes#3 Signed-off-by: SuperQ <superq@gmail.com>

dotdc self-assigned this Jun 24, 2022

dotdc added the bug Something isn't working label Jun 24, 2022

dotdc mentioned this issue Jun 27, 2022

fix: replace image label for k3s #4

Closed

vladimir-babichev mentioned this issue Jul 4, 2022

Update variable query string for nodes dashboard #12

Merged

SuperQ added a commit to SuperQ/grafana-dashboards-kubernetes that referenced this issue Jul 16, 2022

Use node_exporter metrics for node CPU

4379d70

Use the node_exporter CPU metrics to get system level data. Fixes: dotdc#3 Signed-off-by: SuperQ <superq@gmail.com>

SuperQ mentioned this issue Jul 16, 2022

Use node_exporter metrics for node CPU #16

Merged

dotdc closed this as completed in #16 Jul 18, 2022

dotdc pushed a commit that referenced this issue Jul 18, 2022

feat: use node_exporter metrics for node CPU (#16)

27e4577

Use the node_exporter CPU metrics to get system level data. Fixes: #3 Signed-off-by: SuperQ <superq@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some metrics are missing. #3

Some metrics are missing. #3

reefland commented Jun 24, 2022

reefland commented Jun 24, 2022 •

edited

Loading

reefland commented Jun 24, 2022

dotdc commented Jun 24, 2022

reefland commented Jun 24, 2022

dotdc commented Jun 26, 2022

reefland commented Jun 27, 2022

dotdc commented Jun 27, 2022

reefland commented Jun 27, 2022

i5Js commented Jul 1, 2022

dotdc commented Jul 1, 2022

i5Js commented Jul 1, 2022

i5Js commented Jul 2, 2022

reefland commented Jul 14, 2022

dotdc commented Jul 14, 2022

reefland commented Jul 14, 2022

dotdc commented Jul 14, 2022 •

edited

Loading

reefland commented Jul 14, 2022

reefland commented Jul 15, 2022

dotdc commented Jul 15, 2022

dotdc commented Jul 15, 2022

SuperQ commented Jul 16, 2022

zentavr commented Aug 9, 2023

zentavr commented Aug 9, 2023

zentavr commented Aug 9, 2023

Some metrics are missing. #3

Some metrics are missing. #3

Comments

reefland commented Jun 24, 2022

reefland commented Jun 24, 2022 • edited Loading

reefland commented Jun 24, 2022

dotdc commented Jun 24, 2022

reefland commented Jun 24, 2022

dotdc commented Jun 26, 2022

reefland commented Jun 27, 2022

dotdc commented Jun 27, 2022

reefland commented Jun 27, 2022

i5Js commented Jul 1, 2022

dotdc commented Jul 1, 2022

i5Js commented Jul 1, 2022

i5Js commented Jul 2, 2022

reefland commented Jul 14, 2022

dotdc commented Jul 14, 2022

reefland commented Jul 14, 2022

dotdc commented Jul 14, 2022 • edited Loading

reefland commented Jul 14, 2022

reefland commented Jul 15, 2022

dotdc commented Jul 15, 2022

dotdc commented Jul 15, 2022

SuperQ commented Jul 16, 2022

zentavr commented Aug 9, 2023

zentavr commented Aug 9, 2023

zentavr commented Aug 9, 2023

reefland commented Jun 24, 2022 •

edited

Loading

dotdc commented Jul 14, 2022 •

edited

Loading