You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.
We're using kops 1.10.0 and k8s 1.10.11. We're using two separate instance groups (IG), nodes (on-demand) and spots (spot), both spread across 3 availability zones. I've applied the appropriate nodeLabels and have defined the following in my k8s-spot-rescheduler deployment manifest:
The nodes IG has the spot=false:PreferNoSchedule taint so the spots IG is preferred. I'm using the cluster autoscaler to autodiscover both IGs via the --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/kubernetes.metis.wtf and these tags exist on both IGs. I've confirmed that pods on most nodes nodes are able to be drained and moved to spots nodes. With an exception:
The spots IG was set to minSize: 1 and maxSize=3 and we had one spots node up and running in us-east-1c
k8s-spot-rescheduler attempted to drain the pods on a nodes node but failed with
I0117 02:16:49.099271 1 rescheduler.go:288] Considering ip-172-20-127-232.ec2.internal for removal
I0117 02:16:49.099797 1 rescheduler.go:293] Cannot drain node: pod metis-internal/rabbitmq-0 can't be rescheduled on any existing spot node
metis-internal/rabbitmq-0 is a statefulSet with a PVC
the PVC resides in us-east-1a so it makes sense why it couldn't be scheduled on the spots node
Why didn't the failure to schedule metis-internal/rabbitmq-0 trigger the cluster autoscaler to try to provision a new spots node until it created one in the same availability zone? I'm wondering if k8s-spot-rescheduler would have actually evicted the pod, the cluster autoscaler would have noticed that a pod needed to be scheduled and would have spun up a new node in the spots IG.
The text was updated successfully, but these errors were encountered:
Taint can be add to on-demand instance group other than spot-instance IG like below.
labels = "kubernetes.io/role=common,lifecycle=OnDemand"
taints = "lifecycle=OnDemand:PreferNoSchedule"
This works for me.
In my experience the taint just tells the K8s scheduler to try scheduling any unscheduled pods onto an existing spot instance node, and it doesn't tell the cluster autoscaler to scale up on spot instances to make room if there aren't any spot instances available.
I having the same issue so I was thinking creating automation which will see if there is an on-demand node is up in the environment and if yes I will add a few spot node so k8s-spot-rescheduler can move the pod to this spot and we will get rid of the on-demand node.
We can implement similer in k8s-spot-rescheduler. Was thinking we can have a parameter which will take the name of spot IG or ASG and if we don't have spot capacity we will scale that IG or ASG(can use CA's code for scaling).
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We're using kops
1.10.0
and k8s1.10.11
. We're using two separate instance groups (IG),nodes
(on-demand) andspots
(spot), both spread across 3 availability zones. I've applied the appropriate nodeLabels and have defined the following in my k8s-spot-rescheduler deployment manifest:The
nodes
IG has thespot=false:PreferNoSchedule
taint so thespots
IG is preferred. I'm using the cluster autoscaler to autodiscover both IGs via the--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/kubernetes.metis.wtf
and these tags exist on both IGs. I've confirmed that pods on mostnodes
nodes are able to be drained and moved tospots
nodes. With an exception:spots
IG was set tominSize: 1
andmaxSize=3
and we had onespots
node up and running in us-east-1cnodes
node but failed withmetis-internal/rabbitmq-0
is a statefulSet with a PVCspots
nodeWhy didn't the failure to schedule
metis-internal/rabbitmq-0
trigger the cluster autoscaler to try to provision a newspots
node until it created one in the same availability zone? I'm wondering if k8s-spot-rescheduler would have actually evicted the pod, the cluster autoscaler would have noticed that a pod needed to be scheduled and would have spun up a new node in thespots
IG.The text was updated successfully, but these errors were encountered: