-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop scale-down from considering registered upcoming nodes as candidates #5488
Stop scale-down from considering registered upcoming nodes as candidates #5488
Conversation
…ir count This does make us call len() in a bunch of places within CSR, but allows for greater flexibility - it's possible to act on the sets of nodes determined by Readiness.
// The initial nodes have to be filtered out of the all nodes list so that scale-down can't consider them as candidates. Otherwise, with | ||
// aggressive scale-down settings, we could be removing the nodes before they have a chance to first become ready (the duration | ||
// of which should be unrelated to the scale-down settings). | ||
allNodes = subtractNodesByName(allNodes, a.clusterStateRegistry.GetClusterReadiness().NotStarted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you want to filter here is in my opinion upcoming nodes (since those are the nodes that would have templateNode added for them). I think that by having 2 different paths for providing the same information we're setting ourselves up for future bugs when those get out of sync.
Also - I don't think we should expose Readiness outside of CSR. I know it's technically public today, but so far it was only for the purpose of pushing it to metrics. My main reason is that readiness is a huge mess, where some values are additive and some are not (ResourceUnready is a subset of Unready, but other stuff can be added etc) and we keep changing semantics completely when we're fixing various bugs.
As an example - we used to have LongNotStarted for a long time, that would change the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upcoming as in "fake replacements" are not added to allNodes
in the first place. Upcoming as in "the nodes that are being replaced" consist of Unregistered (for which there are no names so nothing to filter), and NotStarted - which have names and should be filtered out of allNodes
. So it's only "registered upcoming" nodes that should be filtered out.
Good points about having one path and not leaking the details of CSR. I changed CSR so that it is the one determining which nodes should be filtered without leaking any details (apart from the lists being public, but we've had that with the ints as well), in the same method that determines upcoming node counts.
bb046ee
to
d81a5c5
Compare
if newNodes <= 0 { | ||
// Negative value is unlikely but theoretically possible. | ||
continue | ||
} | ||
result[id] = newNodes | ||
upcomingCounts[id] = newNodes | ||
// newNodes should be roughly equal to readiness.NotStarted + readiness.Unregistered. NotStarted are nodes that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this comment. 'roughly equal' doesn't really explain anything - is it equal? Or is it not?
I don't think it is equal and, even worse, the semantic here likely depends on cloudprovider implementation (ie. will work differently on different clouds). Different providers have different logic for 'placeholder' instances for nodes booting up - some return them, some don't (and I'm pretty sure some return them but only after some delay).I think there can be "not yet unregistered" nodes, ie. ones where there aren't even placeholder instances. The exact semantic here depends on cloudprovider (some don't return placeholders for unregistered nodes at all and some need some time after scale-up to return them) and the handling here is probably very brittle (replacing "target - ready - unready - longUnregistered" with "notStarted + unregistered" would probably be a breaking change).
I think maybe the best solution would be to have a new "unseen" value in readiness (calculated as "target - registered - unregistered"?). That would make this method much cleaner with newNodes = notStarted+unregistered+unseen.
But the above would of course require bigger changes that are outside the scope of this PR and are, frankly, quite risky. So for now please just extend this comment to cover that it's not just notStarted+unregistered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, changed the comment.
err := a.ClusterSnapshot.RemoveNode(notStartedNodeName) | ||
if err != nil { | ||
klog.Errorf("Failed to remove NotStarted node %s from cluster snapshot: %v", notStartedNodeName, err) | ||
return errors.ToAutoscalerError(errors.InternalError, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really abort the loop on error here? Looking at snapshot implementation the only error we can get is NotFound. That should never happen [1], but if it did for some reason is it really fatal? Previously we would just continue the loop.
Maybe we can just log error for notFound and fail on other errors (actually failing to remove the node would sound pretty scary)? Or check that the node is in the snapshot before trying to remove it and log error if it's not?
[1] At least I don't see how it could happen, since we use the same list of nodes to initialize CSR and snapshot and there aren't any processor calls in the meantime where I'd expect anyone to do any shenanigans with snapshot (there is TemplateNodeInfoProvider call, but modifying snapshot there strikes me as particularly evil; CustomResourcesProcessor happens before CSR update or building the snapshot so any changes should be consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, all this part cares about is that the node is not in the snapshot - so it seems reasonable to just log and continue on NotFound. I had to export the error and rename a bunch of imports for this though.
} | ||
} | ||
|
||
func TestStaticAutoscalerUpcomingScaleDownCandidates(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, no action expected: This is pretty well written for what it is, but I'm not very fond of those big tests of RunOnce. I'll let this one in, but I'd prefer a set of smaller tests that don't actually call the entire RunOnce. In this case snapshot filtering could be a separate function that could be tested in isolation and if we also wanted to test scale-down logic to make sure it doesn't add nodes that are not in snapshot to candidates (how could it though?), we could test this in scale-down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full agreement here, I just tried to match the convention in this file/pkg.
d81a5c5
to
63379d6
Compare
Without this, with aggressive settings, scale-down could be removing registered upcoming nodes before they have a chance to become ready (the duration of which should be unrelated to the scale-down settings).
63379d6
to
7e67625
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Stop scale-down from considering registered upcoming nodes as candidates.
Does this PR introduce a user-facing change?