Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop scale-down from considering registered upcoming nodes as candidates #5488

Merged
merged 2 commits into from
Feb 10, 2023

Conversation

towca
Copy link
Collaborator

@towca towca commented Feb 8, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Stop scale-down from considering registered upcoming nodes as candidates.

Does this PR introduce a user-facing change?

Fixed a bug where Cluster Autoscaler would sometimes delete recently created nodes while they were still booting up.

…ir count

This does make us call len() in a bunch of places within CSR, but allows
for greater flexibility - it's possible to act on the sets of nodes determined
by Readiness.
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/cluster-autoscaler labels Feb 8, 2023
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and x13n February 8, 2023 12:40
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2023
@towca towca changed the title Jtuznik/upcoming unready fix Stop scale-down from considering NotStarted nodes as candidates Feb 8, 2023
// The initial nodes have to be filtered out of the all nodes list so that scale-down can't consider them as candidates. Otherwise, with
// aggressive scale-down settings, we could be removing the nodes before they have a chance to first become ready (the duration
// of which should be unrelated to the scale-down settings).
allNodes = subtractNodesByName(allNodes, a.clusterStateRegistry.GetClusterReadiness().NotStarted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you want to filter here is in my opinion upcoming nodes (since those are the nodes that would have templateNode added for them). I think that by having 2 different paths for providing the same information we're setting ourselves up for future bugs when those get out of sync.

Also - I don't think we should expose Readiness outside of CSR. I know it's technically public today, but so far it was only for the purpose of pushing it to metrics. My main reason is that readiness is a huge mess, where some values are additive and some are not (ResourceUnready is a subset of Unready, but other stuff can be added etc) and we keep changing semantics completely when we're fixing various bugs.

As an example - we used to have LongNotStarted for a long time, that would change the logic here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upcoming as in "fake replacements" are not added to allNodes in the first place. Upcoming as in "the nodes that are being replaced" consist of Unregistered (for which there are no names so nothing to filter), and NotStarted - which have names and should be filtered out of allNodes. So it's only "registered upcoming" nodes that should be filtered out.

Good points about having one path and not leaking the details of CSR. I changed CSR so that it is the one determining which nodes should be filtered without leaking any details (apart from the lists being public, but we've had that with the ints as well), in the same method that determines upcoming node counts.

@towca towca force-pushed the jtuznik/upcoming-unready-fix branch from bb046ee to d81a5c5 Compare February 9, 2023 13:07
@towca towca changed the title Stop scale-down from considering NotStarted nodes as candidates Stop scale-down from considering registered upcoming nodes as candidates Feb 9, 2023
if newNodes <= 0 {
// Negative value is unlikely but theoretically possible.
continue
}
result[id] = newNodes
upcomingCounts[id] = newNodes
// newNodes should be roughly equal to readiness.NotStarted + readiness.Unregistered. NotStarted are nodes that are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this comment. 'roughly equal' doesn't really explain anything - is it equal? Or is it not?

I don't think it is equal and, even worse, the semantic here likely depends on cloudprovider implementation (ie. will work differently on different clouds). Different providers have different logic for 'placeholder' instances for nodes booting up - some return them, some don't (and I'm pretty sure some return them but only after some delay).

I think there can be "not yet unregistered" nodes, ie. ones where there aren't even placeholder instances. The exact semantic here depends on cloudprovider (some don't return placeholders for unregistered nodes at all and some need some time after scale-up to return them) and the handling here is probably very brittle (replacing "target - ready - unready - longUnregistered" with "notStarted + unregistered" would probably be a breaking change).

I think maybe the best solution would be to have a new "unseen" value in readiness (calculated as "target - registered - unregistered"?). That would make this method much cleaner with newNodes = notStarted+unregistered+unseen.

But the above would of course require bigger changes that are outside the scope of this PR and are, frankly, quite risky. So for now please just extend this comment to cover that it's not just notStarted+unregistered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, changed the comment.

err := a.ClusterSnapshot.RemoveNode(notStartedNodeName)
if err != nil {
klog.Errorf("Failed to remove NotStarted node %s from cluster snapshot: %v", notStartedNodeName, err)
return errors.ToAutoscalerError(errors.InternalError, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really abort the loop on error here? Looking at snapshot implementation the only error we can get is NotFound. That should never happen [1], but if it did for some reason is it really fatal? Previously we would just continue the loop.
Maybe we can just log error for notFound and fail on other errors (actually failing to remove the node would sound pretty scary)? Or check that the node is in the snapshot before trying to remove it and log error if it's not?

[1] At least I don't see how it could happen, since we use the same list of nodes to initialize CSR and snapshot and there aren't any processor calls in the meantime where I'd expect anyone to do any shenanigans with snapshot (there is TemplateNodeInfoProvider call, but modifying snapshot there strikes me as particularly evil; CustomResourcesProcessor happens before CSR update or building the snapshot so any changes should be consistent).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, all this part cares about is that the node is not in the snapshot - so it seems reasonable to just log and continue on NotFound. I had to export the error and rename a bunch of imports for this though.

}
}

func TestStaticAutoscalerUpcomingScaleDownCandidates(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, no action expected: This is pretty well written for what it is, but I'm not very fond of those big tests of RunOnce. I'll let this one in, but I'd prefer a set of smaller tests that don't actually call the entire RunOnce. In this case snapshot filtering could be a separate function that could be tested in isolation and if we also wanted to test scale-down logic to make sure it doesn't add nodes that are not in snapshot to candidates (how could it though?), we could test this in scale-down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full agreement here, I just tried to match the convention in this file/pkg.

@towca towca force-pushed the jtuznik/upcoming-unready-fix branch from d81a5c5 to 63379d6 Compare February 10, 2023 13:38
Without this, with aggressive settings, scale-down could be removing
registered upcoming nodes before they have a chance to become ready
(the duration of which should be unrelated to the scale-down settings).
@towca towca force-pushed the jtuznik/upcoming-unready-fix branch from 63379d6 to 7e67625 Compare February 10, 2023 13:46
@MaciekPytel
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9f04d4f into kubernetes:master Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants