Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS 1.26 Nodes are getting a taint even though they are healty #1446

Closed
ardenercelik opened this issue Sep 29, 2023 · 6 comments
Closed

EKS 1.26 Nodes are getting a taint even though they are healty #1446

ardenercelik opened this issue Sep 29, 2023 · 6 comments

Comments

@ardenercelik
Copy link

ardenercelik commented Sep 29, 2023

What happened:
We are trying to upgrade from EKS 1.25 to EKS 1.26. After upgrading the ami from amazon-eks-node-1.25-v20230825 to amazon-eks-node-1.26-v20230919 the instances receive a taint even though the nodes seem to be healthy. I also noticed that instance type in the console is empty because some of the labels are not added to the node. We also made sure that IMDS can be reached from the system log. Bootstrap.sh does not throw any errors.

In the attachment you can see some relevant outputs.

Conditions and labels from Node - amazon-eks-node-1.26-v20230919

    "labels": {
      "beta.kubernetes.io/arch": "amd64",
      "beta.kubernetes.io/os": "linux",
      "eks.amazonaws.com/capacityType": "ON_DEMAND",
      "eks.amazonaws.com/nodegroup": "eks-st-managed-backend",
      "eks.amazonaws.com/nodegroup-image": "ami-03825bd685c9d66bb",
      "eks.amazonaws.com/sourceLaunchTemplateId": "lt-099bbf51078483fc8",
      "eks.amazonaws.com/sourceLaunchTemplateVersion": "26",
      "faro.com/backend": "1",
      "faro.com/frontend": "1",
      "faro.com/ingress": "1",
      "faro.com/monitoring": "1",
      "faro.com/worker": "1",
      "k8s.io/cloud-provider-aws": "355d0ea6f0ac02b37d5ad83235a4f0f2",
      "kubernetes.io/arch": "amd64",
      "kubernetes.io/hostname": "ip-10-0-33-238.eu-west-1.compute.internal",
      "kubernetes.io/os": "linux"
    }
---------
"conditions": [
      {
        "type": "MemoryPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:49:00Z",
        "lastTransitionTime": "2023-09-27T12:47:57Z",
        "reason": "KubeletHasSufficientMemory",
        "message": "kubelet has sufficient memory available"
      },
      {
        "type": "DiskPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:49:00Z",
        "lastTransitionTime": "2023-09-27T12:47:57Z",
        "reason": "KubeletHasNoDiskPressure",
        "message": "kubelet has no disk pressure"
      },
      {
        "type": "PIDPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:49:00Z",
        "lastTransitionTime": "2023-09-27T12:47:57Z",
        "reason": "KubeletHasSufficientPID",
        "message": "kubelet has sufficient PID available"
      },
      {
        "type": "Ready",
        "status": "True",
        "lastHeartbeatTime": "2023-09-27T12:49:00Z",
        "lastTransitionTime": "2023-09-27T12:48:14Z",
        "reason": "KubeletReady",
        "message": "kubelet is posting ready status"
      }
    ],

Conditions and labels from Node - amazon-eks-node-1.25-v20230825

    "labels": {
      "beta.kubernetes.io/arch": "amd64",
      "beta.kubernetes.io/instance-type": "t3.medium",
      "beta.kubernetes.io/os": "linux",
      "eks.amazonaws.com/capacityType": "ON_DEMAND",
      "eks.amazonaws.com/nodegroup": "eks-st-managed-backend",
      "eks.amazonaws.com/nodegroup-image": "ami-03ed1b0118ecc804f",
      "eks.amazonaws.com/sourceLaunchTemplateId": "lt-099bbf51078483fc8",
      "eks.amazonaws.com/sourceLaunchTemplateVersion": "25",
      "failure-domain.beta.kubernetes.io/region": "eu-west-1",
      "failure-domain.beta.kubernetes.io/zone": "eu-west-1b",
      "faro.com/backend": "1",
      "faro.com/frontend": "1",
      "faro.com/ingress": "1",
      "faro.com/monitoring": "1",
      "faro.com/worker": "1",
      "k8s.io/cloud-provider-aws": "355d0ea6f0ac02b37d5ad83235a4f0f2",
      "kubernetes.io/arch": "amd64",
      "kubernetes.io/hostname": "ip-10-0-37-208.eu-west-1.compute.internal",
      "kubernetes.io/os": "linux",
      "node.kubernetes.io/instance-type": "t3.medium",
      "topology.ebs.csi.aws.com/zone": "eu-west-1b",
      "topology.kubernetes.io/region": "eu-west-1",
      "topology.kubernetes.io/zone": "eu-west-1b"
    }
----------------
    "conditions": [
      {
        "type": "MemoryPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:50:41Z",
        "lastTransitionTime": "2023-09-27T10:03:53Z",
        "reason": "KubeletHasSufficientMemory",
        "message": "kubelet has sufficient memory available"
      },
      {
        "type": "DiskPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:50:41Z",
        "lastTransitionTime": "2023-09-27T10:03:53Z",
        "reason": "KubeletHasNoDiskPressure",
        "message": "kubelet has no disk pressure"
      },
      {
        "type": "PIDPressure",
        "status": "False",
        "lastHeartbeatTime": "2023-09-27T12:50:41Z",
        "lastTransitionTime": "2023-09-27T10:03:53Z",
        "reason": "KubeletHasSufficientPID",
        "message": "kubelet has sufficient PID available"
      },
      {
        "type": "Ready",
        "status": "True",
        "lastHeartbeatTime": "2023-09-27T12:50:41Z",
        "lastTransitionTime": "2023-09-27T10:04:08Z",
        "reason": "KubeletReady",
        "message": "kubelet is posting ready status"
      }
    ],

How to reproduce it (as minimally and precisely as possible):
Change the 1.25 ami to the 1.26 one.
Anything else we need to know?:

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): t3.medium
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.7"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.25"
  • AMI Version: amazon-eks-node-1.26-v20230919
  • Kernel (e.g. uname -a): Linux ip-10-0-32-107.ec2.internal 5.10.192-183.736.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0963f2c76238b64d5"
BUILD_TIME="Tue Sep 19 17:51:08 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"

eks-command-outputs.txt
instance-type-blank
taints
ebs-csi-logs.txt

@cartermckinnon
Copy link
Member

What taint is being applied?

@ardenercelik
Copy link
Author

ardenercelik commented Sep 29, 2023

> What taint is being applied?
This is the taint in the console.

   "taints": [
      {
        "key": "node.cloudprovider.kubernetes.io/uninitialized",
        "value": "true",
        "effect": "NoSchedule"
      }
    ]

@cartermckinnon
Copy link
Member

cartermckinnon commented Sep 29, 2023

TLDR: The taint is expected behavior when --cloud-provider=external is used for kubelet.

Some more info in the k8s docs: https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager

In the past, kubelet called cloud-provider APIs directly, and had a bunch of cloud-provider-specific code compiled into it as a result. There's been an effort for many Kubernetes release to remove this logic from kubelet, moving it to a control plane component (cloud-controller-manager) as needed. The kubelet will apply this taint prior to joining the cluster, and cloud-controller-manager will remove it once it fulfills its duties. This happens very quickly in most cases.

Are you seeing that the taint is not removed?

@cartermckinnon cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023
@ardenercelik
Copy link
Author

Hello, yes even after 24hours the taint does not get removed even though the instances are healthy and can reach the IMDS.

@cartermckinnon
Copy link
Member

Please open a ticket with AWS Support, we'll have to look into your specific environment. 👍

@akshaypatidar1999
Copy link

I was getting the same error. Turns out the issue was with the IAM role permissions. The cluster role did not have DescribeAvailabiltyZones permission

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants