Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Blank metrics endpoint in controller #2307

Closed
rtrevi opened this issue Jan 23, 2025 · 2 comments
Closed

[BUG]: Blank metrics endpoint in controller #2307

rtrevi opened this issue Jan 23, 2025 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rtrevi
Copy link

rtrevi commented Jan 23, 2025

/kind bug

What happened?
The metrics endpoint for the controller pods are returning empty responses with a 200 status code.
Similar to the issue described here: #1993

The node pods do return metrics when querying their metrics endpoint:

# HELP aws_ebs_csi_nvme_collector_duration_seconds Histogram of NVMe collector scrape duration in seconds.
# TYPE aws_ebs_csi_nvme_collector_duration_seconds histogram
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.001"} 0
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.0025"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.005"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.01"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.025"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.05"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.1"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.25"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="0.5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="1"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="2.5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="5"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="10"} 2
aws_ebs_csi_nvme_collector_duration_seconds_bucket{instance_id="i-0a328a1b4c1506e6e",le="+Inf"} 2
aws_ebs_csi_nvme_collector_duration_seconds_sum{instance_id="i-0a328a1b4c1506e6e"} 0.002813641
aws_ebs_csi_nvme_collector_duration_seconds_count{instance_id="i-0a328a1b4c1506e6e"} 2
# HELP aws_ebs_csi_nvme_collector_errors_total Total number of NVMe collector scrape errors.
# TYPE aws_ebs_csi_nvme_collector_errors_total counter
aws_ebs_csi_nvme_collector_errors_total{instance_id="i-0a328a1b4c1506e6e"} 0
# HELP aws_ebs_csi_nvme_collector_scrapes_total Total number of NVMe collector scrapes.
# TYPE aws_ebs_csi_nvme_collector_scrapes_total counter
aws_ebs_csi_nvme_collector_scrapes_total{instance_id="i-0a328a1b4c1506e6e"} 2

What you expected to happen?
The metrics endpoint returning any of the metrics described at: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md

How to reproduce it (as minimally and precisely as possible)?

  1. Fetch upstream manifests:
$ kustomize build 'github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/gcr?ref=release-1.38' > resources.yaml
  1. Enable metric endpoints through kustomize (helm is not an option):
$ cat <<EOF > kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: aws-ebs-csi
resources:
- resources.yaml

patches:
  ## Enabling metrics
  - target:
      kind: Deployment
      name: ebs-csi-controller
    patch: |
      - op: test
        path: /spec/template/spec/containers/0/name
        value: ebs-plugin
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --http-endpoint=0.0.0.0:3301
      - op: add
        path: /spec/template/spec/containers/0/ports/-
        value:
          name: http-metrics
          containerPort: 3301
          protocol: TCP
  - target:
      kind: DaemonSet
      name: ebs-csi-node
    patch: |
      - op: test
        path: /spec/template/spec/containers/0/name
        value: ebs-plugin
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --http-endpoint=0.0.0.0:3302
      - op: add
        path: /spec/template/spec/containers/0/ports/-
        value:
          name: http-metrics
          containerPort: 3302
          protocol: TCP
EOF
  1. Apply:
$ kubectl apply -k .
  1. Port forward:
kubectl -n aws-ebs-csi port-forward ebs-csi-controller-58447dfcb4-brm6z 8080:3301
  1. Try metrics endpoint and reeceive succesful empty response:
$ curl localhost:8080/metrics -v
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=underscores
< Date: Thu, 23 Jan 2025 01:16:36 GMT
< Content-Length: 0
< 
* Connection #0 to host localhost left intact

Environment

  • Kubernetes version (use kubectl version):
    Client Version: v1.31.5
    Kustomize Version: v5.4.2
    Server Version: v1.30.8-eks-2d5f260
  • Driver version: 1.38.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2025
@rtrevi rtrevi changed the title Blank metrics endpoint in controller bug: Blank metrics endpoint in controller Jan 23, 2025
@rtrevi rtrevi changed the title bug: Blank metrics endpoint in controller [BUG]: Blank metrics endpoint in controller Jan 23, 2025
@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jan 24, 2025

Hi @rtrevi thank you for the issue and detailed reproduction steps. I'm sorry that you ran into this pain. While I was able to get metrics working on my cluster, I still see that we can improve our metrics documentation.

I followed your instructions and then added two steps:

  1. Dynamically provision a volume
  2. Ensure I port-forwarded the leader ebs-csi-controller pod

Note that in EBS CSI Driver versions ≤ 1.39.0, metrics are not initialized to 0 upon driver startup, but instead they are initialized right before first increment. That means metrics won't start appearing until you take some action (like provisioning a volume). Also note that you have to make sure you are port-forwarding the pod with the leader EBS CSI Driver controller container.

With these two additional steps I was able to see metrics:

❯ curl localhost:8080/metrics -v
*   Trying 127.0.0.1:8080...
...
>
< HTTP/1.1 200 OK
....
<
# HELP aws_ebs_csi_api_request_errors_total ebs_csi_aws_com metric
# TYPE aws_ebs_csi_api_request_errors_total counter
aws_ebs_csi_api_request_errors_total{request="CreateVolume"} 16
# HELP cloudprovider_aws_api_request_errors ebs_csi_aws_com metric
# TYPE cloudprovider_aws_api_request_errors counter
cloudprovider_aws_api_request_errors{request="CreateVolume"} 16


I've created two backlog items for my team to ensure future users won't run into the same pain you did:

  1. Consider initializing controller ebs-plugin metrics at 0 to prevent broken dashboards and misfiring alerts.
  2. Improve metrics.md to mention how to correctly find leader controller driver (Step as written did not work for me) and explain what the helm value controller.enableMetrics is actually doing. (So those who deploy with Kustomize are better supported)

Let me know if these two additional steps solved your issue. If not, you may need to add a few additional details about your environment.

❯ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.32.0-eks-5ca49cb
...
Driver version 1.38.1

@rtrevi
Copy link
Author

rtrevi commented Jan 24, 2025

Consider initializing controller ebs-plugin metrics at 0 to prevent broken dashboards and misfiring alerts.

This would be great, the current state is somewhat misleading. Publishing metrics even if no activity has been recorded would indeed avoid triggering alerts for missing data points.
I'll close this issue and open a feature request for initializing the metrics at 0.

Improve metrics.md to mention how to correctly find leader controller driver (Step as written did not work for me) and explain what the helm value controller.enableMetrics is actually doing. (So those who deploy with Kustomize are better supported).

Agree, they didn't work out for me either so I just ignored them 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants