Stop scale-down from considering registered upcoming nodes as candidates #5488

towca · 2023-02-08T12:39:50Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Stop scale-down from considering registered upcoming nodes as candidates.

Does this PR introduce a user-facing change?

Fixed a bug where Cluster Autoscaler would sometimes delete recently created nodes while they were still booting up.

…ir count This does make us call len() in a bunch of places within CSR, but allows for greater flexibility - it's possible to act on the sets of nodes determined by Readiness.

MaciekPytel · 2023-02-09T11:16:52Z

cluster-autoscaler/core/static_autoscaler.go

+	// The initial nodes have to be filtered out of the all nodes list so that scale-down can't consider them as candidates. Otherwise, with
+	// aggressive scale-down settings, we could be removing the nodes before they have a chance to first become ready (the duration
+	// of which should be unrelated to the scale-down settings).
+	allNodes = subtractNodesByName(allNodes, a.clusterStateRegistry.GetClusterReadiness().NotStarted)


What you want to filter here is in my opinion upcoming nodes (since those are the nodes that would have templateNode added for them). I think that by having 2 different paths for providing the same information we're setting ourselves up for future bugs when those get out of sync.

Also - I don't think we should expose Readiness outside of CSR. I know it's technically public today, but so far it was only for the purpose of pushing it to metrics. My main reason is that readiness is a huge mess, where some values are additive and some are not (ResourceUnready is a subset of Unready, but other stuff can be added etc) and we keep changing semantics completely when we're fixing various bugs.

As an example - we used to have LongNotStarted for a long time, that would change the logic here.

Upcoming as in "fake replacements" are not added to allNodes in the first place. Upcoming as in "the nodes that are being replaced" consist of Unregistered (for which there are no names so nothing to filter), and NotStarted - which have names and should be filtered out of allNodes. So it's only "registered upcoming" nodes that should be filtered out.

Good points about having one path and not leaking the details of CSR. I changed CSR so that it is the one determining which nodes should be filtered without leaking any details (apart from the lists being public, but we've had that with the ints as well), in the same method that determines upcoming node counts.

MaciekPytel · 2023-02-10T12:10:57Z

cluster-autoscaler/clusterstate/clusterstate.go

 		if newNodes <= 0 {
 			// Negative value is unlikely but theoretically possible.
 			continue
 		}
-		result[id] = newNodes
+		upcomingCounts[id] = newNodes
+		// newNodes should be roughly equal to readiness.NotStarted + readiness.Unregistered. NotStarted are nodes that are


I don't like this comment. 'roughly equal' doesn't really explain anything - is it equal? Or is it not?
I don't think it is equal and, even worse, the semantic here likely depends on cloudprovider implementation (ie. will work differently on different clouds). Different providers have different logic for 'placeholder' instances for nodes booting up - some return them, some don't (and I'm pretty sure some return them but only after some delay).
I think there can be "not yet unregistered" nodes, ie. ones where there aren't even placeholder instances. The exact semantic here depends on cloudprovider (some don't return placeholders for unregistered nodes at all and some need some time after scale-up to return them) and the handling here is probably very brittle (replacing "target - ready - unready - longUnregistered" with "notStarted + unregistered" would probably be a breaking change).

I think maybe the best solution would be to have a new "unseen" value in readiness (calculated as "target - registered - unregistered"?). That would make this method much cleaner with newNodes = notStarted+unregistered+unseen.

But the above would of course require bigger changes that are outside the scope of this PR and are, frankly, quite risky. So for now please just extend this comment to cover that it's not just notStarted+unregistered?

Yeah you're right, changed the comment.

cluster-autoscaler/clusterstate/clusterstate.go

MaciekPytel · 2023-02-10T12:39:09Z

cluster-autoscaler/core/static_autoscaler.go

+		err := a.ClusterSnapshot.RemoveNode(notStartedNodeName)
+		if err != nil {
+			klog.Errorf("Failed to remove NotStarted node %s from cluster snapshot: %v", notStartedNodeName, err)
+			return errors.ToAutoscalerError(errors.InternalError, err)


Should we really abort the loop on error here? Looking at snapshot implementation the only error we can get is NotFound. That should never happen [1], but if it did for some reason is it really fatal? Previously we would just continue the loop.
Maybe we can just log error for notFound and fail on other errors (actually failing to remove the node would sound pretty scary)? Or check that the node is in the snapshot before trying to remove it and log error if it's not?

[1] At least I don't see how it could happen, since we use the same list of nodes to initialize CSR and snapshot and there aren't any processor calls in the meantime where I'd expect anyone to do any shenanigans with snapshot (there is TemplateNodeInfoProvider call, but modifying snapshot there strikes me as particularly evil; CustomResourcesProcessor happens before CSR update or building the snapshot so any changes should be consistent).

Agreed, all this part cares about is that the node is not in the snapshot - so it seems reasonable to just log and continue on NotFound. I had to export the error and rename a bunch of imports for this though.

MaciekPytel · 2023-02-10T13:01:20Z

cluster-autoscaler/core/static_autoscaler_test.go

+	}
+}
+
+func TestStaticAutoscalerUpcomingScaleDownCandidates(t *testing.T) {


nit, no action expected: This is pretty well written for what it is, but I'm not very fond of those big tests of RunOnce. I'll let this one in, but I'd prefer a set of smaller tests that don't actually call the entire RunOnce. In this case snapshot filtering could be a separate function that could be tested in isolation and if we also wanted to test scale-down logic to make sure it doesn't add nodes that are not in snapshot to candidates (how could it though?), we could test this in scale-down.

Full agreement here, I just tried to match the convention in this file/pkg.

Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).

MaciekPytel · 2023-02-10T13:59:51Z

/lgtm
/approve

k8s-ci-robot · 2023-02-10T13:59:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel,towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CA: Make CSR's Readiness keep lists of node names instead of just the…

6978ff8

…ir count This does make us call len() in a bunch of places within CSR, but allows for greater flexibility - it's possible to act on the sets of nodes determined by Readiness.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/cluster-autoscaler labels Feb 8, 2023

k8s-ci-robot requested review from feiskyer and x13n February 8, 2023 12:40

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2023

towca changed the title ~~Jtuznik/upcoming unready fix~~ Stop scale-down from considering NotStarted nodes as candidates Feb 8, 2023

MaciekPytel reviewed Feb 9, 2023

View reviewed changes

towca force-pushed the jtuznik/upcoming-unready-fix branch from bb046ee to d81a5c5 Compare February 9, 2023 13:07

towca changed the title ~~Stop scale-down from considering NotStarted nodes as candidates~~ Stop scale-down from considering registered upcoming nodes as candidates Feb 9, 2023

MaciekPytel reviewed Feb 10, 2023

View reviewed changes

cluster-autoscaler/clusterstate/clusterstate.go Show resolved Hide resolved

MaciekPytel reviewed Feb 10, 2023

View reviewed changes

towca force-pushed the jtuznik/upcoming-unready-fix branch from d81a5c5 to 63379d6 Compare February 10, 2023 13:38

CA: stop passing registered upcoming nodes as scale-down candidates

7e67625

Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).

towca force-pushed the jtuznik/upcoming-unready-fix branch from 63379d6 to 7e67625 Compare February 10, 2023 13:46

k8s-ci-robot assigned MaciekPytel Feb 10, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2023

k8s-ci-robot merged commit 9f04d4f into kubernetes:master Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop scale-down from considering registered upcoming nodes as candidates #5488

Stop scale-down from considering registered upcoming nodes as candidates #5488

towca commented Feb 8, 2023 •

edited

Loading

MaciekPytel Feb 9, 2023

towca Feb 9, 2023

MaciekPytel Feb 10, 2023

towca Feb 10, 2023

MaciekPytel Feb 10, 2023

towca Feb 10, 2023

MaciekPytel Feb 10, 2023

towca Feb 10, 2023

MaciekPytel commented Feb 10, 2023

k8s-ci-robot commented Feb 10, 2023

Stop scale-down from considering registered upcoming nodes as candidates #5488

Stop scale-down from considering registered upcoming nodes as candidates #5488

Conversation

towca commented Feb 8, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Feb 10, 2023

k8s-ci-robot commented Feb 10, 2023

towca commented Feb 8, 2023 •

edited

Loading