ProvisioningRequest booked capacity is scaled down (and shouldn't be) #6517

yaroslava-serdiuk · 2024-02-08T09:55:13Z

Which component are you using?:
cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
When the ProvisioningRequst is Provisioned for check-capacity Provisioning class, ClusterAutoscaler reserve this capacity from other ProvisioningRequest for short period of time in order to provide accurate CA decision for multiple ProvisioningRequests.
However currently this capacity is not blocked from scale down, so we might have a situation when CA marked ProvisioningRequest as Provisioned and then scaled down the capacity, so ProvisioningRequest won't have capacity anymore.

Describe the solution you'd like:
In order to not scale down the booked capacity I suggest to:

add booking capacity mechanism to scale down as well (i.e schedule in-memory pods for Provisioned ProvisioningRequests in to the clusterSnapshot). However, each loop pods could be scheduled on different nodes, preventing them accumulate unneeded time for scale down, so
ignore in-memory created ProvisioningRequest pods when calculate node's utilisation. So scale down simulation will try to schedule ProvisioningRequest pods on different node.

Additional context:
Booking capacity is not limited only for check-capacity ProvisioningClass, this mechanism will be used for atomicScaleUp ProvisioningClass as well.

yaroslava-serdiuk · 2024-02-08T09:56:46Z

/cc @kisieland @mwielgus @x13n

x13n · 2024-02-09T11:46:18Z

I think hinting should be sufficient to stabilize fake pod -> node assignments as long as fake pod uids don't change between loops:

autoscaler/cluster-autoscaler/simulator/scheduling/hinting_simulator.go

Lines 91 to 108 in 6cb5b5a

 func (s *HintingSimulator) findNodeWithHints(clusterSnapshot clustersnapshot.ClusterSnapshot, pod *apiv1.Pod, isNodeAcceptable func(*schedulerframework.NodeInfo) bool) (string, error) { 

 hk := HintKeyFromPod(pod) 

 if hintedNode, hasHint := s.hints.Get(hk); hasHint { 

 if err := s.predicateChecker.CheckPredicates(clusterSnapshot, pod, hintedNode); err == nil { 

 s.hints.Set(hk, hintedNode) 

 nodeInfo, err := clusterSnapshot.NodeInfos().Get(hintedNode) 

 if err != nil { 

 return "", err 

 } 

 if isNodeAcceptable(nodeInfo) { 

 return hintedNode, nil 

 } 

 } 

 } 

 return "", nil 

 }

I'm not sure I understand this:

ignore in-memory created ProvisioningRequest pods when calculate node's utilisation. So scale down simulation will try to schedule ProvisioningRequest pods on different node.

If you schedule PR pods into the snapshot, they should already be considered in scale down simulation (and potentially moved to another node). What do you want to gain by additionally lowering utilization of such nodes?

yaroslava-serdiuk · 2024-02-09T14:18:59Z

Thanks!

If you schedule PR pods into the snapshot, they should already be considered in scale down simulation (and potentially moved to another node). What do you want to gain by additionally lowering utilisation of such nodes?

It's possible to make nodes not underutilised by scheduling fake pods and so block them from scale down, if utilisation threshold is low.
But probably you're right, this is not needed. The user should set the appropriate utilisation threshold.

x13n · 2024-02-12T16:51:50Z

It's possible to make nodes not underutilised by scheduling fake pods and so block them from scale down, if utilisation threshold is low.

Isn't this fine? Scale down will make sure that after nodes are removed, there's still enough capacity for all the pods (including fake ones). Utilization threshold is really only a heuristic for scale down to check if a given node should be considered in scale down simulations at all.

yaroslava-serdiuk · 2024-05-03T11:30:32Z

/assign

yaroslava-serdiuk added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 8, 2024

yaroslava-serdiuk changed the title ~~ProvisioningRequest booked capacity is not scaled down~~ ProvisioningRequest booked capacity is scaled down (and shouldn't be) Feb 8, 2024

towca added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Mar 21, 2024

k8s-ci-robot assigned yaroslava-serdiuk May 3, 2024

This was referenced May 10, 2024

Implement Check-capacity Provisioning Class #6814

Closed

Implement AtomicScaleUp ProvisioningClass #6815

Closed

yaroslava-serdiuk mentioned this issue Jun 19, 2024

BookCapacity for ProvisioningRequest pods #6880

Merged

k8s-ci-robot closed this as completed in #6880 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProvisioningRequest booked capacity is scaled down (and shouldn't be) #6517

ProvisioningRequest booked capacity is scaled down (and shouldn't be) #6517

yaroslava-serdiuk commented Feb 8, 2024 •

edited

Loading

yaroslava-serdiuk commented Feb 8, 2024

x13n commented Feb 9, 2024

yaroslava-serdiuk commented Feb 9, 2024

x13n commented Feb 12, 2024

yaroslava-serdiuk commented May 3, 2024

ProvisioningRequest booked capacity is scaled down (and shouldn't be) #6517

ProvisioningRequest booked capacity is scaled down (and shouldn't be) #6517

Comments

yaroslava-serdiuk commented Feb 8, 2024 • edited Loading

yaroslava-serdiuk commented Feb 8, 2024

x13n commented Feb 9, 2024

yaroslava-serdiuk commented Feb 9, 2024

x13n commented Feb 12, 2024

yaroslava-serdiuk commented May 3, 2024

yaroslava-serdiuk commented Feb 8, 2024 •

edited

Loading