Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/kubeletstats] k8s.node.network.io metric is missing #33993

Closed
alita1991 opened this issue Jul 9, 2024 · 6 comments
Closed

[receiver/kubeletstats] k8s.node.network.io metric is missing #33993

alita1991 opened this issue Jul 9, 2024 · 6 comments
Labels
bug Something isn't working needs triage New item requiring triage receiver/kubeletstats Stale

Comments

@alita1991
Copy link

alita1991 commented Jul 9, 2024

Component(s)

receiver/kubeletstats

What happened?

Description

k8s.node.network.io metric is not collected, while others are (k8s.node.memory., k8s.node.filesystem., etc)

Steps to Reproduce

Provision the collector using the provided config in a K3S / OpenShift environment + ClusterRole RBAC with full access

Expected Result

k8s.node.network.io metric should be collected

Actual Result

k8s.node.network.io metric not found

Collector version

0.102.1

Environment information

3x AWS EC2 VMs + K3S (3 masters + 3 workers)
3x AWS EC2 VMS + OpenShift (3 masters + 3 workers)

OpenTelemetry Collector configuration

receivers:
  kubeletstats:
    templateEnabled: '{{ index .Values "mimir-distributed" "enabled" }}'
    collection_interval: 30s
    auth_type: "serviceAccount"
    endpoint: "${env:KUBELETSTATS_ENDPOINT}"
    extra_metadata_labels:
    - k8s.volume.type
    insecure_skip_verify: true
    metric_groups:
    - container
    - pod
    - volume
    - node
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 50
    spike_limit_percentage: 10
  k8sattributes:
    auth_type: 'serviceAccount'
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.start_time
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.node.name
  resourcedetection/env:
    detectors:
    - env
  resource/remove_container_id:
    attributes:
    - action: delete
      key: container.id
    - action: delete
      key: container_id
exporters:
  logging:
    verbosity: detailed
  otlp:
    endpoint: '{{ template "central.collector.address" $ }}'
    tls:
      insecure: true
service:
  telemetry:
    metrics:
      address: "0.0.0.0:8888"
      level: detailed
  pipelines:
    metrics/kubeletstats:
      templateEnabled: '{{ index .Values "mimir-distributed" "enabled" }}'
      receivers: [kubeletstats]
      processors: [k8sattributes, resourcedetection/env, resource/remove_container_id, memory_limiter, batch]
      exporters: [otlp]

Log output

No errors were found in the log

Additional context

Before opening the ticket, I did some debugging, but I could not find any relevant information in debug mode, I'm trying to understand why this specific metric is not collected and what I can do to investigate the problem more.

Is important to mention that the k8s_pod_network_io_bytes_total metric was collected by the receiver.

@alita1991 alita1991 added bug Something isn't working needs triage New item requiring triage labels Jul 9, 2024
Copy link
Contributor

github-actions bot commented Jul 9, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ChrsMark
Copy link
Member

Hey @alita1991! I tried to reproduce this but I wasn't able on GKE or EKS.

I'm using the following Helm chart values:

mode: daemonset
presets:
  kubeletMetrics:
    enabled: true

config:
  exporters:
    debug:
      verbosity: normal
  receivers:
    kubeletstats:
      collection_interval: 10s
      auth_type: 'serviceAccount'
      endpoint: '${env:K8S_NODE_NAME}:10250'
      insecure_skip_verify: true
      metrics:
        k8s.node.network.io:
          enabled: true

  service:
    pipelines:
      metrics:
        receivers: [kubeletstats]
        processors: [batch]
        exporters: [debug]

And deploy the Collector with helm install daemonset open-telemetry/opentelemetry-collector --set image.repository="otel/opentelemetry-collector-k8s" --set image.tag="0.104.0" --values ds_k8s_metrics.yaml

GKE

v1.29.4-gke.1043004

> k logs -f daemonset-opentelemetry-collector-agent-24x6f | grep k8s.node.network.io
k8s.node.network.io{interface=eth0,direction=receive} 2508490408
k8s.node.network.io{interface=eth0,direction=transmit} 1329730075
k8s.node.network.io{interface=eth0,direction=receive} 2541570721
k8s.node.network.io{interface=eth0,direction=transmit} 1330038333
k8s.node.network.io{interface=eth0,direction=receive} 2541728902
k8s.node.network.io{interface=eth0,direction=transmit} 1330216803
k8s.node.network.io{interface=eth0,direction=receive} 2541792120
k8s.node.network.io{interface=eth0,direction=transmit} 1330323914
k8s.node.network.io{interface=eth0,direction=receive} 2541974411
k8s.node.network.io{interface=eth0,direction=transmit} 1330557979

EKS

v1.30.0-eks-036c24b

> k logs -f daemonset-opentelemetry-collector-agent-58csx | grep k8s.node.network.io
k8s.node.network.io{interface=eth0,direction=receive} 7511134123
k8s.node.network.io{interface=eth0,direction=transmit} 21146466749
k8s.node.network.io{interface=eth0,direction=receive} 7545084343
k8s.node.network.io{interface=eth0,direction=transmit} 21146550460
k8s.node.network.io{interface=eth0,direction=receive} 7545094892
k8s.node.network.io{interface=eth0,direction=transmit} 21146552331

I suggest you verify what the /stats/summary endpoint provide. I suspect it gives no values for this metric to be exported or sth weird. You can run the following debug Pod to get this info. Note that you need to use the same service account that the Collector uses (if the Collector is already running) in order to get access to this endpoint (in my case it was named daemonset-opentelemetry-collector):

kubectl run my-shell --rm -i --tty --image=ubuntu --overrides='{ "apiVersion": "v1", "spec": { "serviceAccountName": "daemonset-opentelemetry-collector", "hostNetwork": true }  }' -- bash
apt update
apt-get install curl jq
export token=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && curl -H "Authorization: Bearer $token" https://$HOSTNAME:10250/stats/summary --insecure

In my case it gave:

{
  "time": "2024-07-10T09:34:01Z",
  "name": "eth0",
  "rxBytes": 3234464903,
  "rxErrors": 0,
  "txBytes": 1197870852,
  "txErrors": 0,
  "interfaces": [
    {
      "name": "eth0",
      "rxBytes": 3234464903,
      "rxErrors": 0,
      "txBytes": 1197870852,
      "txErrors": 0
    }
  ]
}

@alita1991
Copy link
Author

alita1991 commented Jul 10, 2024

Hi,

I tested using your config and got 0 data points for k8s.node.network.io, what could it be? I don't have any RBAC-related issues in the logs.

kubectl logs daemonset-opentelemetry-collector-agent-dw7hf | grep k8s.node.network.io | wc -l
0

For k8s.pod.network.io, is working like expected:

kubectl logs daemonset-opentelemetry-collector-agent-dw7hf | grep k8s.pod.network.io | wc -l
2546

I also tested the scrape via curl, here is the result for one of the nodes:

"node":{
"network":{
"time":"2024-07-10T12:42:59Z",
"name":"",
"interfaces":[
{
"name":"ens5",
"rxBytes":481114242884,
"rxErrors":0,
"txBytes":715126064226,
"txErrors":0
},
{
"name":"ovs-system",
"rxBytes":0,
"rxErrors":0,
"txBytes":0,
"txErrors":0
},
{
"name":"ovn-k8s-mp0",
"rxBytes":5821168746,
"rxErrors":0,
"txBytes":47539598446,
"txErrors":0
},
{
"name":"genev_sys_6081",
"rxBytes":265742543652,
"rxErrors":0,
"txBytes":370984422928,
"txErrors":0
},

@ChrsMark
Copy link
Member

ChrsMark commented Jul 10, 2024

Thank's @alita1991 for checking this!

It seems that in your case the top level info is missing compared to what I see:

{
  "time": "2024-07-10T09:34:01Z",
  "name": "eth0",
  "rxBytes": 3234464903,
  "rxErrors": 0,
  "txBytes": 1197870852,
  "txErrors": 0,
  "interfaces": [
    {
      "name": "eth0",
      "rxBytes": 3234464903,
      "rxErrors": 0,
      "txBytes": 1197870852,
      "txErrors": 0
    }
  ]
}

Also removing these lines from the testing sample at

"rxBytes": 948305524,
"rxErrors": 0,
"txBytes": 12542266,
"txErrors": 0,
makes the unit tests to fail.

The missing information is about the default interface according to https://pkg.go.dev/k8s.io/kubelet@v0.29.3/pkg/apis/stats/v1alpha1#NetworkStats.

Indeed, checking the code it seems that we only extract the top level tx/rx metrics: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.104.0/receiver/kubeletstatsreceiver/internal/kubelet/network.go#L24-L42.

So the question here is if we should consider it as a bug/limitation and expand in order to collect metrics for all of the interfaces instead of just the default. Note that the Interfaces list includes the default, so just by iterating this we will have the default's interface metrics included.

I would like to hear what @TylerHelmuth and @dmitryax think here.

Update: I see it was already reported for pod's metrics at #30196

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 11, 2024
@ChrsMark
Copy link
Member

This is covered by #30196. I'm going to close this one and we can continue on the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage receiver/kubeletstats Stale
Projects
None yet
Development

No branches or pull requests

2 participants