[BUG]: Blank metrics endpoint in controller #2307

rtrevi · 2025-01-23T01:22:31Z

/kind bug

What happened?
The metrics endpoint for the controller pods are returning empty responses with a 200 status code.
Similar to the issue described here: #1993

The node pods do return metrics when querying their metrics endpoint:

# HELP aws_ebs_csi_nvme_collector_duration_seconds Histogram of NVMe collector scrape duration in seconds.
# TYPE aws_ebs_csi_nvme_collector_duration_seconds histogram
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.001"} 0
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.0025"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.005"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.01"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.025"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.05"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.1"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.25"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="1"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="2.5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="10"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="+Inf"} 2
aws_ebs_csi_nvme_collector_duration_seconds_sum{instance_id="i-0a328a1b4c1506e6e"} 0.002813641
aws_ebs_csi_nvme_collector_duration_seconds_count{instance_id="i-0a328a1b4c1506e6e"} 2
# HELP aws_ebs_csi_nvme_collector_errors_total Total number of NVMe collector scrape errors.
# TYPE aws_ebs_csi_nvme_collector_errors_total counter
aws_ebs_csi_nvme_collector_errors_total{instance_id="i-0a328a1b4c1506e6e"} 0
# HELP aws_ebs_csi_nvme_collector_scrapes_total Total number of NVMe collector scrapes.
# TYPE aws_ebs_csi_nvme_collector_scrapes_total counter
aws_ebs_csi_nvme_collector_scrapes_total{instance_id="i-0a328a1b4c1506e6e"} 2

What you expected to happen?
The metrics endpoint returning any of the metrics described at: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md

How to reproduce it (as minimally and precisely as possible)?

Fetch upstream manifests:

$ kustomize build 'github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/gcr?ref=release-1.38' > resources.yaml

Enable metric endpoints through kustomize (helm is not an option):

$ cat <<EOF > kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: aws-ebs-csi
resources:
- resources.yaml

patches:
  ## Enabling metrics
  - target:
      kind: Deployment
      name: ebs-csi-controller
    patch: |
      - op: test
        path: /spec/template/spec/containers/0/name
        value: ebs-plugin
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --http-endpoint=0.0.0.0:3301
      - op: add
        path: /spec/template/spec/containers/0/ports/-
        value:
          name: http-metrics
          containerPort: 3301
          protocol: TCP
  - target:
      kind: DaemonSet
      name: ebs-csi-node
    patch: |
      - op: test
        path: /spec/template/spec/containers/0/name
        value: ebs-plugin
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --http-endpoint=0.0.0.0:3302
      - op: add
        path: /spec/template/spec/containers/0/ports/-
        value:
          name: http-metrics
          containerPort: 3302
          protocol: TCP
EOF

Apply:

$ kubectl apply -k .

Port forward:

kubectl -n aws-ebs-csi port-forward ebs-csi-controller-58447dfcb4-brm6z 8080:3301

Try metrics endpoint and reeceive succesful empty response:

$ curl localhost:8080/metrics -v
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=underscores
< Date: Thu, 23 Jan 2025 01:16:36 GMT
< Content-Length: 0
< 
* Connection #0 to host localhost left intact

Environment

Kubernetes version (use kubectl version):
Client Version: v1.31.5
Kustomize Version: v5.4.2
Server Version: v1.30.8-eks-2d5f260
Driver version: 1.38.1

The text was updated successfully, but these errors were encountered:

AndrewSirenko · 2025-01-24T16:36:20Z

Hi @rtrevi thank you for the issue and detailed reproduction steps. I'm sorry that you ran into this pain. While I was able to get metrics working on my cluster, I still see that we can improve our metrics documentation.

I followed your instructions and then added two steps:

Dynamically provision a volume
Ensure I port-forwarded the leader ebs-csi-controller pod

Note that in EBS CSI Driver versions ≤ 1.39.0, metrics are not initialized to 0 upon driver startup, but instead they are initialized right before first increment. That means metrics won't start appearing until you take some action (like provisioning a volume). Also note that you have to make sure you are port-forwarding the pod with the leader EBS CSI Driver controller container.

With these two additional steps I was able to see metrics:

❯ curl localhost:8080/metrics -v
*   Trying 127.0.0.1:8080...
...
>
< HTTP/1.1 200 OK
....
<
# HELP aws_ebs_csi_api_request_errors_total ebs_csi_aws_com metric
# TYPE aws_ebs_csi_api_request_errors_total counter
aws_ebs_csi_api_request_errors_total{request="CreateVolume"} 16
# HELP cloudprovider_aws_api_request_errors ebs_csi_aws_com metric
# TYPE cloudprovider_aws_api_request_errors counter
cloudprovider_aws_api_request_errors{request="CreateVolume"} 16

I've created two backlog items for my team to ensure future users won't run into the same pain you did:

Consider initializing controller ebs-plugin metrics at 0 to prevent broken dashboards and misfiring alerts.
Improve metrics.md to mention how to correctly find leader controller driver (Step as written did not work for me) and explain what the helm value controller.enableMetrics is actually doing. (So those who deploy with Kustomize are better supported)

Let me know if these two additional steps solved your issue. If not, you may need to add a few additional details about your environment.

❯ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.32.0-eks-5ca49cb
...
Driver version 1.38.1

rtrevi · 2025-01-24T22:28:58Z

Consider initializing controller ebs-plugin metrics at 0 to prevent broken dashboards and misfiring alerts.

This would be great, the current state is somewhat misleading. Publishing metrics even if no activity has been recorded would indeed avoid triggering alerts for missing data points.
I'll close this issue and open a feature request for initializing the metrics at 0.

Improve metrics.md to mention how to correctly find leader controller driver (Step as written did not work for me) and explain what the helm value controller.enableMetrics is actually doing. (So those who deploy with Kustomize are better supported).

Agree, they didn't work out for me either so I just ignored them 😅

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2025

rtrevi changed the title ~~Blank metrics endpoint in controller~~ bug: Blank metrics endpoint in controller Jan 23, 2025

rtrevi changed the title ~~bug: Blank metrics endpoint in controller~~ [BUG]: Blank metrics endpoint in controller Jan 23, 2025

rtrevi closed this as completed Jan 24, 2025

rtrevi mentioned this issue Jan 24, 2025

[ENHANCEMENT] Intialize metrics at 0 on startup #2312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Blank metrics endpoint in controller #2307

[BUG]: Blank metrics endpoint in controller #2307

rtrevi commented Jan 23, 2025

AndrewSirenko commented Jan 24, 2025 •

edited

Loading

rtrevi commented Jan 24, 2025

[BUG]: Blank metrics endpoint in controller #2307

[BUG]: Blank metrics endpoint in controller #2307

Comments

rtrevi commented Jan 23, 2025

AndrewSirenko commented Jan 24, 2025 • edited Loading

rtrevi commented Jan 24, 2025

AndrewSirenko commented Jan 24, 2025 •

edited

Loading