Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter doesn't remove empty nodes #1914

Closed
itaibenyishai opened this issue Jun 9, 2022 · 5 comments
Closed

Karpenter doesn't remove empty nodes #1914

itaibenyishai opened this issue Jun 9, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@itaibenyishai
Copy link

itaibenyishai commented Jun 9, 2022

Version

Karpenter: using the snapshot image of commit 06bd428
Kubernetes: v1.19

Expected Behavior

Karpenter should remove empty nodes from cluster after ttlSecondsAfterEmpty

Actual Behavior

Karpenter does not remove these empty nodes, only when rolling back to 0.10.0 it removed the empty nodes and respected ttlSecondsAfterEmpty.

Steps to Reproduce the Problem

  1. Deploy the snapshot image of karpenter
kubectl set image deployment/karpenter -n karpenter controller=public.ecr.aws/karpenter-snapshots/controller:06bd4282b6cfd2419b5f6e31340b7d32d00d167e
  1. Scale a deployment so karpenter will add nodes

  2. Scale deployment to 0 and notice that after ttlSecondsAfterEmpty the emtpy nodes are still in the cluster.

  3. Roll back to 0.10.0

kubectl set image deployment/karpenter -n karpenter controller=public.ecr.aws/karpenter/controller:v0.10.0@sha256:e27cc9fb91f80ed9c5c26202c984a9d7d871ce6008dd4f83f50f3516c9f2ce8e
  1. Notice that the nodes are removed. Also in the 0.10.0 controllers logs you can see logs regarding "applying ttl" and "removing empty nodes". These logs didn't exist in the snapshot image.

Resource Specs and Logs

Provisioner spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ttlSecondsUntilExpired: 2592000
  ttlSecondsAfterEmpty: 30
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
      - g4dn.2xlarge
    - key: karpenter.sh/capacity-type
      operator: In
      values:
      - on-demand
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64      
  limits:
    resources:
      cpu: "25"
      memory: 98Gi
  provider:
    subnetSelector:
      karpenter.sh/discovery: 'eksworkshop-eksctl'
    securityGroupSelector:
      karpenter.sh/discovery: 'eksworkshop-eksctl'

the deployment we scaled to test karpenter:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 6
              memory: 250Mi

Karpenter controller logs (snapshot image) when we experienced the bug:

2022-06-09T07:04:39.905Z	INFO	controller	Waiting for unschedulable pods	{"commit": "06bd428"}
2022-06-09T07:04:44.282Z	INFO	controller	Waiting for unschedulable pods	{"commit": "06bd428"}
2022-06-09T07:07:35.039Z	DEBUG	controller	Discovered 529 EC2 instance types	{"commit": "06bd428"}
2022-06-09T07:07:35.318Z	DEBUG	controller	Discovered subnets: [subnet-02356b6e6ea319f1d (us-east-1a) subnet-0c089b62cee3b65bd (us-east-1b) subnet-00cfe0e66b10b3821 (us-east-1c)]	{"commit": "06bd428"}
2022-06-09T07:07:35.552Z	DEBUG	controller	Discovered EC2 instance types zonal offerings	{"commit": "06bd428"}
2022-06-09T07:10:28.353Z	DEBUG	controller.aws.launchtemplate	Deleted launch template lt-08d15ea2c6b67fb4c	{"commit": "06bd428"}
2022-06-09T07:12:37.000Z	DEBUG	controller	Discovered 529 EC2 instance types	{"commit": "06bd428"}
2022-06-09T07:12:37.229Z	DEBUG	controller	Discovered subnets: [subnet-02356b6e6ea319f1d (us-east-1a) subnet-0c089b62cee3b65bd (us-east-1b) subnet-00cfe0e66b10b3821 (us-east-1c)]	{"commit": "06bd428"}
2022-06-09T07:12:37.464Z	DEBUG	controller	Discovered EC2 instance types zonal offerings	{"commit": "06bd428"}
2022-06-09T07:17:38.420Z	DEBUG	controller	Discovered 529 EC2 instance types	{"commit": "06bd428"}
2022-06-09T07:17:38.672Z	DEBUG	controller	Discovered subnets: [subnet-02356b6e6ea319f1d (us-east-1a) subnet-0c089b62cee3b65bd (us-east-1b) subnet-00cfe0e66b10b3821 (us-east-1c)]	{"commit": "06bd428"}
2022-06-09T07:17:38.868Z	DEBUG	controller	Discovered EC2 instance types zonal offerings	{"commit": "06bd428"}

No logs mentioning TTL or empty nodes. (waited over 15 minutes when the ttlSecondsAfterEmpty is 30 seconds)
When we rolled back to 0.10.0 these are the logs:

2022-06-09T07:25:05.101Z	INFO	controller.controller.node	Starting workers	{"commit": "00661aa", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-06-09T07:25:05.101Z	INFO	controller.node	Added TTL to empty node	{"commit": "00661aa", "node": "ip-192-168-61-113.ec2.internal"}
2022-06-09T07:25:05.106Z	INFO	controller.node	Added TTL to empty node	{"commit": "00661aa", "node": "ip-192-168-18-196.ec2.internal"}
2022-06-09T07:25:05.118Z	INFO	controller.node	Added TTL to empty node	{"commit": "00661aa", "node": "ip-192-168-51-180.ec2.internal"}
2022-06-09T07:25:05.143Z	INFO	controller.node	Added TTL to empty node	{"commit": "00661aa", "node": "ip-192-168-18-196.ec2.internal"}
2022-06-09T07:25:05.146Z	INFO	controller.node	Added TTL to empty node	{"commit": "00661aa", "node": "ip-192-168-51-180.ec2.internal"}
2022-06-09T07:25:35.001Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "00661aa", "node": "ip-192-168-61-113.ec2.internal"}
2022-06-09T07:25:35.002Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "00661aa", "node": "ip-192-168-51-180.ec2.internal"}
2022-06-09T07:25:35.002Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "00661aa", "node": "ip-192-168-18-196.ec2.internal"}
2022-06-09T07:25:35.043Z	INFO	controller.termination	Cordoned node	{"commit": "00661aa", "node": "ip-192-168-61-113.ec2.internal"}
2022-06-09T07:25:35.049Z	INFO	controller.termination	Cordoned node	{"commit": "00661aa", "node": "ip-192-168-18-196.ec2.internal"}
2022-06-09T07:25:35.058Z	INFO	controller.termination	Cordoned node	{"commit": "00661aa", "node": "ip-192-168-51-180.ec2.internal"}
2022-06-09T07:25:35.259Z	INFO	controller.termination	Deleted node	{"commit": "00661aa", "node": "ip-192-168-61-113.ec2.internal"}
2022-06-09T07:25:35.293Z	INFO	controller.termination	Deleted node	{"commit": "00661aa", "node": "ip-192-168-18-196.ec2.internal"}
2022-06-09T07:25:35.323Z	INFO	controller.termination	Deleted node	{"commit": "00661aa", "node": "ip-192-168-51-180.ec2.internal"}
@itaibenyishai itaibenyishai added the bug Something isn't working label Jun 9, 2022
@tzneal
Copy link
Contributor

tzneal commented Jun 9, 2022

You can look at the initialized label to see if the nodes were every initialized per Karpenter (kubelet reporting as ready, any startup taints removed and all extended resources registered):

kubectl get node -L karpenter.sh/initialized

Since you are using the g4dn instance types, you will also need to run the NVidia device plugin DS which is responsible for registering the GPU extended resources. If not, Karpenter waits for the NVidia GPU resource indefinitely.

https://github.com/NVIDIA/k8s-device-plugin

@itaibenyishai
Copy link
Author

Is the wait for NVidia GPU resource new? Because in the previous 0.10.0 version Karpenter did remove the empty nodes.
Anyways I changed the instance type to c5 and Karpenter did remove the empty nodes. Thank you and sorry for the miss understanding !! 🙏

@tzneal
Copy link
Contributor

tzneal commented Jun 9, 2022

Yes, this is new behavior. kubelet zeros out extended resources on startup up which made launching nodes with GPU resources unreliable. Karpenter would place the resources on the nodes at creation, kubelet would zero them out and then the pods would be evicted. If the device plugin that is responsible for registering the resources responded quickly enough, the replacement pods could be rebound. To work around this, Karpenter no longer binds pods to nodes and allows kube-scheduler to perform the binding after the resources are registered by the device plugin.

If this doesn't occur, we don't consider the node empty to allow it time to finish initializing. Keeping the node allows for detecting configuration issues like this one (a missing device plugin) instead of launching another node that would also fail. Feel free to re-open if you have any other questions or experience any other odd behavior with the snapshot.

@tzneal tzneal closed this as completed Jun 9, 2022
@itaibenyishai
Copy link
Author

itaibenyishai commented Jun 12, 2022

Is there a way to limit the time Karpenter keeps the nodes that aren't initialized?
And add logs for visibility to understand why these nodes aren't removed?

In my case, because of my miss configuration the nodes were up for over 24 hours, this configuration issue can be pretty costly if Karpenter keeps these nodes forever 😅

Thanks!

@gorkemgoknar
Copy link

gorkemgoknar commented Jul 13, 2023

This new behaviour is weird and hanging nodes especially on GPU nodes which are expensive.

Currently no "work" pod utilizes nodes and according to provisioners , they should be deleted after 30 seconds , they should not wait for expiry time (which default one is more than a year).

Karpenter should not depend on an external plugin for node cleanup (unless we specifically require it)
https://github.com/NVIDIA/k8s-device-plugin
Notes from Nvidia device plugin:

The NVIDIA device plugin is currently lacking
-Comprehensive GPU health checking features
-GPU cleanup features

  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400
ip-10-0-XX-135.ec2.internal            Ready,SchedulingDisabled   <none>   12h     v1.23.17-eks-0a21954         spot            dedicated-node                g4dn.xlarge     karpenter-gpu-fleet
ip-10-0-XX-73.ec2.internal             Ready,SchedulingDisabled   <none>   12h     v1.23.17-eks-0a21954         spot            dedicated-node                g4dn.xlarge     karpenter-gpu-fleet

Above are just 2 examples, there are more nodes like this fail to be descheduled after a scale down.

Although there is no request for a new node nodes with failedConsistencyCheck should be removed after a time

Warning FailedConsistencyCheck 8m5s (x68 over 11h) karpenter expected resource "nvidia.com/gpu" didn't register on the node


Name:              NODE_NAME
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g4dn.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    k8s.io/cloud-provider-aws=XXXXXXXXXXXXXX
                    karpenter.k8s.aws/instance-accelerator-count=1
                    karpenter.k8s.aws/instance-accelerator-manufacturer=nvidia
                    karpenter.k8s.aws/instance-accelerator-memory=16384
                    karpenter.k8s.aws/instance-accelerator-name=t4
                    karpenter.k8s.aws/instance-ami-id=ami-XXXXXXXXXXXXXX
                    karpenter.k8s.aws/instance-category=g
                    karpenter.k8s.aws/instance-cpu=4
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=true
                    karpenter.k8s.aws/instance-family=g4dn
                    karpenter.k8s.aws/instance-generation=4
                    karpenter.k8s.aws/instance-gpu-count=1
                    karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
                    karpenter.k8s.aws/instance-gpu-memory=16384
                    karpenter.k8s.aws/instance-gpu-name=t4
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=125
                    karpenter.k8s.aws/instance-memory=16384
                    karpenter.k8s.aws/instance-network-bandwidth=5000
                    karpenter.k8s.aws/instance-pods=29
                    karpenter.k8s.aws/instance-size=xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/provisioner-name=karpenter-gpu-fleet
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-XX-YYY.ec2.internal
                    kubernetes.io/os=linux
                    node.coqui.com/instance-type=gpu
                    node.coqui.com/node-type=dedicated-node
                    node.kubernetes.io/instance-type=g4dn.xlarge
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        alpha.kubernetes.io/provided-node-ip: XXXXXXX
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 13 Jul 2023 00:11:28 +0300
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  NODE_NAME
  AcquireTime:     <unset>
  RenewTime:       Thu, 13 Jul 2023 12:28:59 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 13 Jul 2023 12:27:13 +0300   Thu, 13 Jul 2023 00:12:07 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 13 Jul 2023 12:27:13 +0300   Thu, 13 Jul 2023 00:12:07 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 13 Jul 2023 12:27:13 +0300   Thu, 13 Jul 2023 00:12:07 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 13 Jul 2023 12:27:13 +0300   Thu, 13 Jul 2023 00:12:28 +0300   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   INTERNALIP
  Hostname:     NODENAME
  InternalDNS:  NODENAME
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           73388012Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16078204Ki
  pods:                        29
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3920m
  ephemeral-storage:           66560649924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15388028Ki
  pods:                        29
System Info:
  Machine ID:                 XXXXXXXXXXXXXXXXXXXXX
  System UUID:                XXXXXXXXXXXXXXXXXXXXX
  Boot ID:                    XXXXXXXXXXXXXXXXXXXXX
  Kernel Version:             5.4.247-162.350.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.23.17-eks-0a21954
  Kube-Proxy Version:         v1.23.17-eks-0a21954
ProviderID:                   XXXXXXXXXXXXXXXXXXXXX
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 aws-cloudwatch-agent-aws-cloudwatch-metrics-58nqz       50m (1%)      100m (2%)   50Mi (0%)        200Mi (1%)     12h
  kube-system                 aws-for-fluent-bit-f8rnc                                50m (1%)      0 (0%)      50Mi (0%)        250Mi (1%)     12h
  kube-system                 aws-node-qwrgg                                          25m (0%)      0 (0%)      0 (0%)           0 (0%)         12h
  kube-system                 aws-node-termination-handler-ht9tx                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  kube-system                 ebs-csi-node-8cjz2                                      30m (0%)      0 (0%)      120Mi (0%)       768Mi (5%)     12h
  kube-system                 kube-proxy-tmvtx                                        100m (2%)     0 (0%)      0 (0%)           0 (0%)         12h
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-fbw9b    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  monitoring                  nvidia-dcgm-exporter-l7wjm                              50m (1%)      100m (2%)   128Mi (0%)       128Mi (0%)     12h
  nvidia-device-plugin        nvidia-device-plugin-dsg7l                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         305m (7%)   200m (5%)
  memory                      348Mi (2%)  1346Mi (8%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
Events:
  Type     Reason                  Age                  From       Message
  ----     ------                  ----                 ----       -------
  Warning  FailedConsistencyCheck  8m5s (x68 over 11h)  karpenter  expected resource "nvidia.com/gpu" didn't register on the node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants