Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS Metrics Server can't scrape pod/node metrics - Unauthorized 401 #963

Closed
mokhirashakira opened this issue Feb 18, 2022 · 15 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mokhirashakira
Copy link

mokhirashakira commented Feb 18, 2022

What happened:
Metric server is not able to read metrics with the error: metrics not available yet

  • HPA's can't read metrics
  • kubectl top pod/nodes return error: metrics not available yet

What you expected to happen:

Metrics server to scrape all pods and nodes.

Anything else we need to know?:

Everything is in the details section.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): EKS

  • Container Network Setup (flannel, calico, etc.): EKS VPC CNI

  • Kubernetes version (use kubectl version): 1.21.5-eks-bc4871b

  • Metrics Server manifest

spoiler for Metrics Server manifest:
  kind: ClusterRole
  metadata:
    labels:
      rbac.authorization.k8s.io/aggregate-to-admin: "true"
      rbac.authorization.k8s.io/aggregate-to-edit: "true"
      rbac.authorization.k8s.io/aggregate-to-view: "true"
    name: system:aggregated-metrics-reader
  rules:
  - apiGroups:
    - metrics.k8s.io
    resources:
    - pods
    - nodes
    verbs:
    - get
    - list
    - watch
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    name: metrics-server:system:auth-delegator
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: system:auth-delegator
  subjects:
  - kind: ServiceAccount
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: metrics-server-auth-reader
    namespace: kube-system
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: extension-apiserver-authentication-reader
  subjects:
  - kind: ServiceAccount
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: apiregistration.k8s.io/v1
  kind: APIService
  metadata:
    name: v1beta1.metrics.k8s.io
  spec:
    group: metrics.k8s.io
    groupPriorityMinimum: 100
    insecureSkipTLSVerify: true
    service:
      name: metrics-server
      namespace: kube-system
      port: 443
    version: v1beta1
    versionPriority: 100
  ---
  apiVersion: v1
  kind: ServiceAccount
  metadata:
    name: metrics-server
    namespace: kube-system
  ---
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    labels:
      k8s-app: metrics-server
    name: metrics-server
    namespace: kube-system
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        k8s-app: metrics-server
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: null
        labels:
          k8s-app: metrics-server
        name: metrics-server
      spec:
        containers:
        - command:
          - /metrics-server
          - --v=2
          - --kubelet-preferred-address-types=InternalIP
          - --cert-dir=/tmp
          - --secure-port=4443
          image: private-repo:metrics-server-amd64-v0.3.6
          imagePullPolicy: IfNotPresent
          name: metrics-server
          ports:
          - containerPort: 4443
            name: main-port
            protocol: TCP
          resources: {}
          securityContext:
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /tmp
            name: tmp-dir
        dnsPolicy: ClusterFirst
        nodeSelector:
          kubernetes.io/arch: amd64
          kubernetes.io/os: linux
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: metrics-server
        serviceAccountName: metrics-server
        terminationGracePeriodSeconds: 30
        volumes:
        - emptyDir: {}
          name: tmp-dir
  ---
  apiVersion: v1
  kind: Service
  metadata:
    labels:
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: Metrics-server
    name: metrics-server
    namespace: kube-system
  spec:
    ports:
    - port: 443
      protocol: TCP
      targetPort: main-port
    selector:
      k8s-app: metrics-server
    type: ClusterIP
  ---
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    name: system:metrics-server
  rules:
  - apiGroups:
    - ""
    resources:
    - pods
    - nodes
    - nodes/stats
    - namespaces
    - configmaps
    verbs:
    - get
    - list
    - watch
  • Metrics server logs:
spoiler for Metrics Server logs:
I0218 15:49:00.089765       1 manager.go:148] ScrapeMetrics: time: 30.029378925s, nodes: 0, pods: 0
E0218 15:49:00.089852       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-165.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-165.ec2.internal (10.224.57.165): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-153.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-153.ec2.internal (10.224.57.153): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-55-86.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-55-86.ec2.internal (10.224.55.86): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-51-140.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-51-140.ec2.internal (10.224.51.140): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-55-184.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-55-184.ec2.internal (10.224.55.184): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-40.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-40.ec2.internal (10.224.57.40): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-48-241.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-48-241.ec2.internal (10.224.48.241): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-53-70.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-53-70.ec2.internal (10.224.53.70): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-50-241.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-50-241.ec2.internal (10.224.50.241): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-57-184.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-57-184.ec2.internal (10.224.57.184): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-53-158.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-53-158.ec2.internal (10.224.53.158): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-10-224-51-42.ec2.internal: unable to fetch metrics from Kubelet ip-10-224-51-42.ec2.internal (10.224.51.42): Get https://10.224.51.42:10250/stats/summary?only_cpu_and_memory=true: dial tcp 10.224.51.42:10250: i/o timeout]
E0218 15:49:00.093086       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.093086       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.093250       1 errors.go:77] Unauthorized
E0218 15:49:00.099175       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.099367       1 errors.go:77] Unauthorized
E0218 15:49:00.109152       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.109462       1 errors.go:77] Unauthorized
E0218 15:49:00.115605       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.115756       1 errors.go:77] Unauthorized
E0218 15:49:00.125091       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.125264       1 errors.go:77] Unauthorized
E0218 15:49:00.130884       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.130983       1 errors.go:77] Unauthorized
E0218 15:49:00.139747       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.139917       1 errors.go:77] Unauthorized
E0218 15:49:00.145794       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.145963       1 errors.go:77] Unauthorized
E0218 15:49:00.155290       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.155471       1 errors.go:77] Unauthorized
E0218 15:49:00.161147       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.161306       1 errors.go:77] Unauthorized
E0218 15:49:00.181555       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.181730       1 errors.go:77] Unauthorized
E0218 15:49:00.187571       1 webhook.go:196] Failed to make webhook authorizer request: Unauthorized
E0218 15:49:00.187749       1 errors.go:77] Unauthorized
  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Spec:
  Group:                     metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            metrics-server
    Namespace:       kube-system
    Port:            443
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2021-09-20T12:27:29Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available
Events:                    <none>

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 18, 2022
@serathius
Copy link
Contributor

I'm not familiar with EKS, however I remember similar issue that solved the problem by adding some EKS specific configuration. I was not able to find it now, will continue looking.

@stevehipwell
Copy link
Contributor

@mokhirashakira have you patched your kube-proxy config so that metrics are bound to 0.0.0.0:10249 instead of the default 127.0.0.1:10249 (ref).

@serathius
Copy link
Contributor

@stevehipwell I think that issue you linked is about kube-proxy which is unrelated to Metrics Server.

@stevehipwell
Copy link
Contributor

@serathius it's been a long time since I worked on this so I might have miss remembered which metrics component was impacted by this.

@mokhirashakira are you actually using Calico as your CNI?

@mokhirashakira
Copy link
Author

@serathius it's been a long time since I worked on this so I might have miss remembered which metrics component was impacted by this.

@mokhirashakira are you actually using Calico as your CNI?

Sorry, no it's not Calico. It's the EKS CNI - https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html

@stevehipwell
Copy link
Contributor

@mokhirashakira it looks like you're running v0.3.6 on an EKS v1.21 cluster? How are you installing MS? I know that the latest Helm chart works correctly on EKS v1.21.

@mokhirashakira
Copy link
Author

Not through Helm, it was installed using the following https://docs.aws.amazon.com/eks/latest/userguide/metrics-server.html

@stevehipwell
Copy link
Contributor

@mokhirashakira have you tried re-running the apply step with an up to date manifest? The file linked to is dynamic and needs to be kept up to date, the MS version in your report is v0.3.6 while the current MS version is v0.6.1; looking at the release history v0.3.6 is from October 2019 and was targeting K8s v1.14.

@serathius are the compatibilities on the README correct, does MS just use a well defined subset of client-go? Even if the binary is compatible with K8s v1.21 the manifests for v0.3.6 might not be?

@serathius
Copy link
Contributor

@serathius are the compatibilities on the README correct,

Yes, to my knowledge, however they are based on deprecation notices and we didn't do much testing.

does MS just use a well defined subset of client-go?

Not sure what you mean by "well defined subset". As any binary we need to build with specific versions of dependencies. Before each release I try to make sure we pick up latest client-go version, however we cannot guarantee that it will be forever forward compatible.

Even if the binary is compatible with K8s v1.21 the manifests for v0.3.6 might not be?

True, one example is that both manifests and binary picks some specific api versions like v1 for apps/Deployment. If those versions are no longer supported by K8s, either manifests of binary can break.

@stevehipwell
Copy link
Contributor

Thanks @serathius, I think we can say that MS is pretty compatible forwards and backwards so the README compatibility matrix is correct to best effort.

True, one example is that both manifests and binary picks some specific api versions like v1 for apps/Deployment. If those versions are no longer supported by K8s, either manifests of binary can break.

I think there could also be other considerations in the manifests that mean that they might apply correctly to an EKS v1.21 cluster but not work or break when an older cluster is upgraded.

@yangjunmyfm192085
Copy link
Contributor

@TBeijen
Copy link

TBeijen commented Mar 24, 2022

I experienced a similar problem on EKS v1.21: v1beta1.metrics.k8s.io shown as unavailable via kubectl get apiservice, hpa's not being able to scale.

Cloudwatch kube-controller-manager showed lines like:

E0322 09:35:26.782575      12 namespaced_resources_deleter.go:161] unable to get all supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
W0322 09:35:26.905136      12 garbagecollector.go:703] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
E0322 09:35:27.463353      12 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

In this case it were new clusters where some things were different compared to clusters we already have running where everything works fine:

The terraform module by default gives (among others) the security group attached to nodes rules like this:

inbound TCP port 10250, source=<eks-cluster-sg-id>, "Cluster API to node kubelets"

After adding a sg rule that matches the container port configured in metrics server when using the latest helm chart, everything worked.

inbound TCP port 4443, source=<eks-cluster-sg-id>, "Cluster API to metrics-server"

This does align with the endpoints of the metrics-server service:

$: kubectl -n kube-system get endpoints metrics-server
NAME             ENDPOINTS                               AGE
metrics-server   100.64.180.230:4443,100.64.5.211:4443   41d

Still wrapping my head around if this makes sense, VPC CNI networking is not the easiest part of EKS.

Update: Reading OP again which really mentions a 401, my problem obviously was a different one. Comparing the cluster rules in the OP, I notice a subtle difference between those, and the ones installed via helm chart on EKS 1.21:

  • OP, and EKS 1.17 cluster: resource 'metrics/stat`
  • EKS 1.21, using latest metrics-server helm chart: resource nodes/metrics

@stevehipwell
Copy link
Contributor

@TBeijen your issue is/was separate and is specifically about the changes that were made in the v18 release of the EKS module dropping almost all SG rules. I'm not sure if the module docs have been updated but it's covered in a number of issues. As an aside, and I'm sure you're aware of this, when using the AWS VPC CNI you don't need to use host network for MS as long as your SGs are configured correctly.

@TBeijen
Copy link

TBeijen commented Mar 24, 2022

@stevehipwell Thanks for the confirmation. I was aware of the dropped sg rules, by reading the v18 docs so it is documented. I just failed to grasp the impact on Extension API servers straight away.

This quite well summarizes how node sg affects to what extent EKS API can access pods (https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html) and clarifies how the added sg rule indeed fixes things:

By default, when new network interfaces are allocated for pods, ipamD uses the security groups and subnet of the node's primary network interface.

Sorry for the noise and distracting of OPs 401 issue.

@mokhirashakira
Copy link
Author

Upgrading the metrics server to the latest version helped fix the issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants