EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

mokhirashakira · 2022-02-18T21:49:50Z

What happened:
Metric server is not able to read metrics with the error: metrics not available yet

HPA's can't read metrics
kubectl top pod/nodes return error: metrics not available yet

What you expected to happen:

Metrics server to scrape all pods and nodes.

Anything else we need to know?:

Everything is in the details section.

Environment:

Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): EKS
Container Network Setup (flannel, calico, etc.): EKS VPC CNI
Kubernetes version (use kubectl version): 1.21.5-eks-bc4871b
Metrics Server manifest

spoiler for Metrics Server manifest:

  kind: ClusterRole
  metadata:
    labels:
      rbac.authorization.k8s.io/aggregate-to-admin: "true"
      rbac.authorization.k8s.io/aggregate-to-edit: "true"
      rbac.authorization.k8s.io/aggregate-to-view: "true"
    name: system:aggregated-metrics-reader
  rules:
  - apiGroups:
    - metrics.k8s.io
    resources:
    - pods
    - nodes
    verbs:
    - get
    - list
    - watch
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    name: metrics-server:system:auth-delegator
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: system:auth-delegator
  subjects:
  - kind: ServiceAccount
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: metrics-server-auth-reader
    namespace: kube-system
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: extension-apiserver-authentication-reader
  subjects:
  - kind: ServiceAccount
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: apiregistration.k8s.io/v1
  kind: APIService
  metadata:
    name: v1beta1.metrics.k8s.io
  spec:
    group: metrics.k8s.io
    groupPriorityMinimum: 100
    insecureSkipTLSVerify: true
    service:
      name: metrics-server
      namespace: kube-system
      port: 443
    version: v1beta1
    versionPriority: 100
  ---
  apiVersion: v1
  kind: ServiceAccount
  metadata:
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    labels:
      k8s-app: metrics-server
    name: metrics-server
    namespace: kube-system
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        k8s-app: metrics-server
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: null
        labels:
          k8s-app: metrics-server
        name: metrics-server
      spec:
        containers:
        - command:
          - /metrics-server
          - --v=2
          - --kubelet-preferred-address-types=InternalIP
          - --cert-dir=/tmp
          - --secure-port=4443
          image: private-repo:metrics-server-amd64-v0.3.6
          imagePullPolicy: IfNotPresent
          name: metrics-server
          ports:
          - containerPort: 4443
            name: main-port
            protocol: TCP
          resources: {}
          securityContext:
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /tmp
            name: tmp-dir
        dnsPolicy: ClusterFirst
        nodeSelector:
          kubernetes.io/arch: amd64
          kubernetes.io/os: linux
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: metrics-server
        serviceAccountName: metrics-server
        terminationGracePeriodSeconds: 30
        volumes:
        - emptyDir: {}
          name: tmp-dir
  ---
  apiVersion: v1
  kind: Service
  metadata:
    labels:
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: Metrics-server
    name: metrics-server
    namespace: kube-system
  spec:
    ports:
    - port: 443
      protocol: TCP
      targetPort: main-port
    selector:
      k8s-app: metrics-server
    type: ClusterIP
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    name: system:metrics-server
  rules:
  - apiGroups:
    - ""
    resources:
    - pods
    - nodes
    - nodes/stats
    - namespaces
    - configmaps
    verbs:
    - get
    - list
    - watch

Metrics server logs:

spoiler for Metrics Server logs:

I0218 15:49:00.089765       1 manager.go:148] ScrapeMetrics: time: 30.029378925s, nodes: 0, pods: 0
E0218 15:49:00.089852       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-165.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-165.ec2.internal (10.224.57.165): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-153.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-153.ec2.internal (10.224.57.153): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-55-86.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-55-86.ec2.internal (10.224.55.86): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-51-140.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-51-140.ec2.internal (10.224.51.140): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-55-184.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-55-184.ec2.internal (10.224.55.184): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-40.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-40.ec2.internal (10.224.57.40): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-48-241.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-48-241.ec2.internal (10.224.48.241): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-53-70.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-53-70.ec2.internal (10.224.53.70): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-50-241.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-50-241.ec2.internal (10.224.50.241): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-184.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-184.ec2.internal (10.224.57.184): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-53-158.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-53-158.ec2.internal (10.224.53.158): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-51-42.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-51-42.ec2.internal (10.224.51.42): Get https://10.224.51.42:10250/stats/summary?only_cpu_and_memory=true: dial tcp 10.224.51.42:10250: i/o timeout]
E0218 15:49:00.093086       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.093086       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.093250       1 errors.go:77] Unauthorized
E0218 15:49:00.099175       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.099367       1 errors.go:77] Unauthorized
E0218 15:49:00.109152       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.109462       1 errors.go:77] Unauthorized
E0218 15:49:00.115605       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.115756       1 errors.go:77] Unauthorized
E0218 15:49:00.125091       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.125264       1 errors.go:77] Unauthorized
E0218 15:49:00.130884       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.130983       1 errors.go:77] Unauthorized
E0218 15:49:00.139747       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.139917       1 errors.go:77] Unauthorized
E0218 15:49:00.145794       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.145963       1 errors.go:77] Unauthorized
E0218 15:49:00.155290       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.155471       1 errors.go:77] Unauthorized
E0218 15:49:00.161147       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.161306       1 errors.go:77] Unauthorized
E0218 15:49:00.181555       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.181730       1 errors.go:77] Unauthorized
E0218 15:49:00.187571       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.187749       1 errors.go:77] Unauthorized

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Spec:
  Group:                     metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            metrics-server
    Namespace:       kube-system
    Port:            443
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2021-09-20T12:27:29Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available
Events:                    <none>

/kind bug

The text was updated successfully, but these errors were encountered:

serathius · 2022-02-20T11:26:05Z

I'm not familiar with EKS, however I remember similar issue that solved the problem by adding some EKS specific configuration. I was not able to find it now, will continue looking.

stevehipwell · 2022-02-22T10:30:11Z

@mokhirashakira have you patched your kube-proxy config so that metrics are bound to 0.0.0.0:10249 instead of the default 127.0.0.1:10249 (ref).

serathius · 2022-02-22T10:36:30Z

@stevehipwell I think that issue you linked is about kube-proxy which is unrelated to Metrics Server.

stevehipwell · 2022-02-22T11:02:36Z

@serathius it's been a long time since I worked on this so I might have miss remembered which metrics component was impacted by this.

@mokhirashakira are you actually using Calico as your CNI?

mokhirashakira · 2022-02-22T14:20:33Z

@serathius it's been a long time since I worked on this so I might have miss remembered which metrics component was impacted by this.

@mokhirashakira are you actually using Calico as your CNI?

Sorry, no it's not Calico. It's the EKS CNI - https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html

stevehipwell · 2022-02-22T15:42:05Z

@mokhirashakira it looks like you're running v0.3.6 on an EKS v1.21 cluster? How are you installing MS? I know that the latest Helm chart works correctly on EKS v1.21.

mokhirashakira · 2022-02-24T18:19:05Z

Not through Helm, it was installed using the following https://docs.aws.amazon.com/eks/latest/userguide/metrics-server.html

stevehipwell · 2022-02-24T18:39:48Z

@mokhirashakira have you tried re-running the apply step with an up to date manifest? The file linked to is dynamic and needs to be kept up to date, the MS version in your report is v0.3.6 while the current MS version is v0.6.1; looking at the release history v0.3.6 is from October 2019 and was targeting K8s v1.14.

@serathius are the compatibilities on the README correct, does MS just use a well defined subset of client-go? Even if the binary is compatible with K8s v1.21 the manifests for v0.3.6 might not be?

serathius · 2022-02-25T10:37:52Z

@serathius are the compatibilities on the README correct,

Yes, to my knowledge, however they are based on deprecation notices and we didn't do much testing.

does MS just use a well defined subset of client-go?

Not sure what you mean by "well defined subset". As any binary we need to build with specific versions of dependencies. Before each release I try to make sure we pick up latest client-go version, however we cannot guarantee that it will be forever forward compatible.

Even if the binary is compatible with K8s v1.21 the manifests for v0.3.6 might not be?

True, one example is that both manifests and binary picks some specific api versions like v1 for apps/Deployment. If those versions are no longer supported by K8s, either manifests of binary can break.

stevehipwell · 2022-02-25T12:11:14Z

Thanks @serathius, I think we can say that MS is pretty compatible forwards and backwards so the README compatibility matrix is correct to best effort.

True, one example is that both manifests and binary picks some specific api versions like v1 for apps/Deployment. If those versions are no longer supported by K8s, either manifests of binary can break.

I think there could also be other considerations in the manifests that mean that they might apply correctly to an EKS v1.21 cluster but not work or break when an older cluster is upgraded.

yangjunmyfm192085 · 2022-03-22T03:59:43Z

Hi, @mokhirashakira, Does the solution in the known issues solve your problem?
https://github.com/kubernetes-sigs/metrics-server/blob/master/KNOWN_ISSUES.md#unable-to-work-properly-in-amazon-eks

TBeijen · 2022-03-24T06:12:11Z

I experienced a similar problem on EKS v1.21: v1beta1.metrics.k8s.io shown as unavailable via kubectl get apiservice, hpa's not being able to scale.

Cloudwatch kube-controller-manager showed lines like:

E0322 09:35:26.782575      12 namespaced_resources_deleter.go:161] unable to get all supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
W0322 09:35:26.905136      12 garbagecollector.go:703] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
E0322 09:35:27.463353      12 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

In this case it were new clusters where some things were different compared to clusters we already have running where everything works fine:

Provisioned using v18 of https://github.com/terraform-aws-modules/terraform-aws-eks, which is more restrictive in the sg rules it sets up by default
Nodes and pods in secondary private cidr, cluster API endpoints only in primary private subnets.

The terraform module by default gives (among others) the security group attached to nodes rules like this:

inbound TCP port 10250, source=<eks-cluster-sg-id>, "Cluster API to node kubelets"

After adding a sg rule that matches the container port configured in metrics server when using the latest helm chart, everything worked.

inbound TCP port 4443, source=<eks-cluster-sg-id>, "Cluster API to metrics-server"

This does align with the endpoints of the metrics-server service:

$: kubectl -n kube-system get endpoints metrics-server
NAME             ENDPOINTS                               AGE
metrics-server   100.64.180.230:4443,100.64.5.211:4443   41d

Still wrapping my head around if this makes sense, VPC CNI networking is not the easiest part of EKS.

Update: Reading OP again which really mentions a 401, my problem obviously was a different one. Comparing the cluster rules in the OP, I notice a subtle difference between those, and the ones installed via helm chart on EKS 1.21:

OP, and EKS 1.17 cluster: resource 'metrics/stat`
EKS 1.21, using latest metrics-server helm chart: resource nodes/metrics

stevehipwell · 2022-03-24T07:30:14Z

@TBeijen your issue is/was separate and is specifically about the changes that were made in the v18 release of the EKS module dropping almost all SG rules. I'm not sure if the module docs have been updated but it's covered in a number of issues. As an aside, and I'm sure you're aware of this, when using the AWS VPC CNI you don't need to use host network for MS as long as your SGs are configured correctly.

TBeijen · 2022-03-24T08:28:55Z

@stevehipwell Thanks for the confirmation. I was aware of the dropped sg rules, by reading the v18 docs so it is documented. I just failed to grasp the impact on Extension API servers straight away.

This quite well summarizes how node sg affects to what extent EKS API can access pods (https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html) and clarifies how the added sg rule indeed fixes things:

By default, when new network interfaces are allocated for pods, ipamD uses the security groups and subnet of the node's primary network interface.

Sorry for the noise and distracting of OPs 401 issue.

mokhirashakira · 2022-04-11T16:54:30Z

Upgrading the metrics server to the latest version helped fix the issue. Thank you!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 18, 2022

mokhirashakira closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

mokhirashakira commented Feb 18, 2022 •

edited

Loading

serathius commented Feb 20, 2022

stevehipwell commented Feb 22, 2022

serathius commented Feb 22, 2022

stevehipwell commented Feb 22, 2022

mokhirashakira commented Feb 22, 2022

stevehipwell commented Feb 22, 2022

mokhirashakira commented Feb 24, 2022

stevehipwell commented Feb 24, 2022

serathius commented Feb 25, 2022

stevehipwell commented Feb 25, 2022

yangjunmyfm192085 commented Mar 22, 2022

TBeijen commented Mar 24, 2022 •

edited

Loading

stevehipwell commented Mar 24, 2022

TBeijen commented Mar 24, 2022 •

edited

Loading

mokhirashakira commented Apr 11, 2022

EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

Comments

mokhirashakira commented Feb 18, 2022 • edited Loading

serathius commented Feb 20, 2022

stevehipwell commented Feb 22, 2022

serathius commented Feb 22, 2022

stevehipwell commented Feb 22, 2022

mokhirashakira commented Feb 22, 2022

stevehipwell commented Feb 22, 2022

mokhirashakira commented Feb 24, 2022

stevehipwell commented Feb 24, 2022

serathius commented Feb 25, 2022

stevehipwell commented Feb 25, 2022

yangjunmyfm192085 commented Mar 22, 2022

TBeijen commented Mar 24, 2022 • edited Loading

stevehipwell commented Mar 24, 2022

TBeijen commented Mar 24, 2022 • edited Loading

mokhirashakira commented Apr 11, 2022

mokhirashakira commented Feb 18, 2022 •

edited

Loading

TBeijen commented Mar 24, 2022 •

edited

Loading

TBeijen commented Mar 24, 2022 •

edited

Loading