-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA: refactor PredicateChecker into ClusterSnapshot #7497
CA: refactor PredicateChecker into ClusterSnapshot #7497
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ed9232e
to
27420ef
Compare
/hold |
e377759
to
d84511f
Compare
/assign @BigDarkClown |
d84511f
to
d78b5d8
Compare
d78b5d8
to
e4d5002
Compare
@@ -56,6 +56,7 @@ func (p *filterOutExpendable) Process(context *context.AutoscalingContext, pods | |||
// CA logic from before migration to scheduler framework. So let's keep it for now | |||
func (p *filterOutExpendable) addPreemptingPodsToSnapshot(pods []*apiv1.Pod, ctx *context.AutoscalingContext) error { | |||
for _, p := range pods { | |||
// TODO(DRA): Figure out if/how to use the predicate-checking SchedulePod() here instead - otherwise this doesn't work with DRA pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this post-v1.32 TODO work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I listed all of the post-MVP work in #7530 description - this is the "Priority-based preempting pods using DRA" part.
@@ -223,7 +221,7 @@ func (r *RemovalSimulator) findPlaceFor(removedNode string, pods []*apiv1.Pod, n | |||
|
|||
// remove pods from clusterSnapshot first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we change this comment to "// unscheduled pods from clusterSnapshot first" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated the comment!
@@ -0,0 +1,126 @@ | |||
/* | |||
Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new files should read 2024 (let's hope we land this in 2024 :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I think I got them all
continue | ||
} | ||
|
||
if !preFilterResult.AllNodes() && !preFilterResult.NodeNames.Has(nodeInfo.Node().Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an opportunity to rename the AllNodes
method to something more descriptive like AllNodesAreEligible
?
It also might be helpful for future maintainers to add a comment above these initial two if statements:
// Ensure that this node in the iteration fulfills the passed in nodeMatches filter func
// If only certain nodes are capable of running this pod,
// and if this node in the iteration isn't one of them, try the next node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't rename AllNodes
as that's a part of the scheduler framework, but I certainly agree that we could use some comments here.
Added a bunch of comments, and it made me realize that for some reason we're not checking the PreFilter result in the single-node CheckPredicates
/RunFiltersOnNode
method. This seems like a bug, so I added the check. I also moved the checks around in RunFiltersUntilPassingNode
so that the cheapest ones are checked earliest.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks good
b.ResetTimer() | ||
for i := 0; i < b.N; i++ { | ||
list := clusterSnapshot.data.buildNodeInfoList() | ||
if len(list) != tc.nodeCount+1000 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soooooo, if X != Y we assert that X == Y? I don't fully understand what is happening here, could you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have as much of an idea as you here, I'm just moving this from clustersnapshot_test.go
😅 Looking into git history it's been this way since the test was introduced, I assume this was just supposed to assert the list length. Removed the if and left just the assert, thanks for catching!
BuildTestNode("n2", 3000, 10), | ||
BuildTestNode("n3", 3000, 10), | ||
BuildTestNode("n4", 3000, 10), | ||
BuildTestNode("n5", 3000, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you need to update the test constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this must be a leftover from an older iteration where InitializeClusterSnapshotOrDie
used SchedulePod
instead of ForceAddPod
. It's not needed, reverted to reduce confusion.
@@ -127,7 +127,7 @@ func initializeDefaultOptions(opts *AutoscalerOptions, informerFactory informers | |||
opts.FrameworkHandle = fwHandle | |||
} | |||
if opts.ClusterSnapshot == nil { | |||
opts.ClusterSnapshot = predicate.NewPredicateSnapshot(base.NewBasicClusterSnapshot(), opts.FrameworkHandle) | |||
opts.ClusterSnapshot = predicate.NewPredicateSnapshot(base.NewBasicSnapshotBase(), opts.FrameworkHandle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as to others - I think NewBasicClusterStorage
would be better, or NewBasicClusterState
. SnapshotBase
does not really mean anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (NewBasicClusterStore()
)
@@ -293,9 +278,9 @@ func TestDebugInfo(t *testing.T) { | |||
assert.NoError(t, err) | |||
|
|||
// with default predicate checker | |||
defaultPredicateChecker, err := newTestPredicateChecker() | |||
defaultPluginnRunner, err := newTestPluginRunner(clusterSnapshot, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: defaultPluginRunner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
) | ||
|
||
// PredicateSnapshot implements ClusterSnapshot on top of a SnapshotBase by using | ||
// SchedulerBasedPredicateChecker to check scheduler predicates. | ||
type PredicateSnapshot struct { | ||
clustersnapshot.SnapshotBase | ||
predicateChecker *predicatechecker.SchedulerBasedPredicateChecker | ||
pluginRunner *SchedulerPluginRunner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe this could be using some abstract PluginRunner
interface? Would make it easier to run unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of mocking such a dependency, IMO it just makes tests less useful. Currently, unit tests can just do testsnapshot.NewTestSnapshotOrDie()
to get a snapshot that will actually behave ~the same as in production code. If they were to mock the plugin runner under the snapshot instead, they would have to maintain the mock, and the snapshot could behave differently than in production code (e.g. SchedulePod()
letting a Pod in that PredicateSnapshot
normally wouldn't because of the predicates failing).
In general IMO we should mock/fake at the lowest possible level so that as much of the actual implementation as possible is tested. In this case, the framework.Handle
level makes the most sense to me, as we can easily fake it and there isn't almost any actual logic there. NewTestSnapshotOrDie()
just encapsulates that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
return CreateTestPodsWithPrefix("p", n) | ||
} | ||
|
||
// AssignTestPodsToNodes assigns test pods to test nodes based on their index position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably rename this comment to
// AssignTestPodsToNodes distributes test pods evenly across test nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"fmt" | ||
|
||
"k8s.io/client-go/informers" | ||
"k8s.io/kubernetes/pkg/scheduler/apis/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes having "k8s.io/kubernetes/pkg/scheduler/apis/config"
and "k8s.io/kubernetes/pkg/scheduler/apis/config/latest"
in the same import block
I think this is probably the best solution
scheduler_config "k8s.io/kubernetes/pkg/scheduler/apis/config"
scheduler_config_latest "k8s.io/kubernetes/pkg/scheduler/apis/config/latest"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (but removed the underscores for consistency)
func (p *SchedulerPluginRunner) RunFiltersUntilPassingNode(pod *apiv1.Pod, nodeMatches func(*framework.NodeInfo) bool) (string, clustersnapshot.SchedulingError) { | ||
nodeInfosList, err := p.snapshotBase.ListNodeInfos() | ||
if err != nil { | ||
return "", clustersnapshot.NewSchedulingInternalError(pod, "ClusterSnapshot not provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "ClusterSnapshot not provided" the right message here for this error condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that's my bad. Changed to an appropriate message about listing node infos.
if err != nil { | ||
return nil, 0, err | ||
} | ||
|
||
if nodeName == "" { | ||
nodeName = s.findNode(similarPods, clusterSnapshot, pod, loggingQuota, isNodeAcceptable) | ||
nodeName, err = s.trySchedule(similarPods, clusterSnapshot, pod, loggingQuota, isNodeAcceptable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the upstream ramifications of this new error condition have. Would it ever make sense to have breakOnFailure
set to false here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was for trySchedule
and tryScheduleUsingHints
to only fail on the "unexpected" errors (to cover the ForceAddPod
error condition that was previously below, and now is a part of SchedulePod
, which is called in trySchedule
and tryScheduleUsingHints
), and return "", nil
if the predicates don't pass. So if I got that right breakOnFailure
, and TrySchedulePods
in general should work the same as before (modulo the "truly unexpected" errors from SchedulerPluginRunner
that I mention in the other comment).
@BigDarkClown Could you verify this one as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read it, the difference is that instead of the previous flow (find node using hints -> find node -> add pod) we moved to (find node using hints and schedule pod -> find node and schedule pod) in a single call. I think this is semantically the same, the only difference is that the TrySchedulePods
will exit earlier in its body than before , but still on the same errors (scheduling). This looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments to the trySchedule
methods and refactored tryScheduleUsingHints
a bit to make things clearer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments help here, I don't love the density of stuff in these flows but that's all inherited, the changes here are appropriately evolutionary, thx!
return "" | ||
return "", nil | ||
} else if err != nil { | ||
// Unexpected error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to classify SchedulingInternalError
, FailingPredicateError
as "Unexpected error" like this? The latter in particular seems like we might want to set to unschedulable like we're doing on L113 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this check is definitely missing FailingPredicateError
- my bad, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to check against SchedulingInternalError
following Bartek's comment.
…hecker This decouples PredicateChecker from the Framework initialization logic, and allows creating multiple PredicateChecker instances while only initializing the framework once. This commit also fixes how CA integrates with Framework metrics. Instead of Registering them they're only Initialized so that CA doesn't expose scheduler metrics. And the initialization is moved from multiple different places to the Handle constructor.
2a387dd
to
0755451
Compare
To handle DRA properly, scheduling predicates will need to be run whenever Pods are scheduled in the snapshot. PredicateChecker always needs a ClusterSnapshot to work, and ClusterSnapshot scheduling methods need to run the predicates first. So it makes most sense to have PredicateChecker be a dependency for ClusterSnapshot implementations, and move the PredicateChecker methods to ClusterSnapshot. This commit mirrors PredicateChecker methods in ClusterSnapshot (with the exception of FitsAnyNode which isn't used anywhere and is trivial to do via FitsAnyNodeMatching). Further commits will remove the PredicateChecker interface and move the implementation under clustersnapshot. Dummy methods are added to current ClusterSnapshot implementations to get the tests to pass. Further commits will actually implement them. PredicateError is refactored into a broader SchedulingError so that the ClusterSnapshot methods can return a single error that the callers can use to distinguish between a failing predicate and other, unexpected errors.
PredicateSnapshot implements the ClusterSnapshot methods that need to run predicates on top of a ClusterSnapshotStore. testsnapshot pkg is introduced, providing functions abstracting away the snapshot creation for tests. ClusterSnapshot tests are moved near PredicateSnapshot, as it'll be the only "full" implementation.
…he ClusterSnapshotStore change
For DRA, this component will have to call the Reserve phase in addition to just checking predicates/filters. The new version also makes more sense in the context of PredicateSnapshot, which is the only context now. While refactoring, I noticed that CheckPredicates for some reason doesn't check the provided Node against the eligible Nodes returned from PreFilter (while FitsAnyNodeMatching does do that). This seems like a bug, so the check is added. The checks in FitsAnyNodeMatching are also reordered so that the cheapest ones are checked earliest.
0755451
to
054d5d2
Compare
/lgtm |
// Check schedulability on only newly created node | ||
if err := e.predicateChecker.CheckPredicates(e.clusterSnapshot, pod, estimationState.lastNodeName); err == nil { | ||
// Try to schedule the pod on only newly created node. | ||
if err := e.clusterSnapshot.SchedulePod(pod, estimationState.lastNodeName); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this single imperative function call w/ err response is is much cleaner now 👍
} | ||
// The pod can't be scheduled on the newly created node because of scheduling predicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this comment would be better right above L177
my only open comment is a nit about comment placemnt, not worth holding /lgtm |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.
To handle DRA properly, scheduling predicates/filters always need to be run whenever scheduling a pod to a node inside the snapshot (so that the DRA scheduler plugin can compute the necessary allocation). The way that the code is structured currently doesn't make this requirement obvious, and we risk future changes breaking DRA behavior (e.g. new logic that schedules pods inside the snapshot gets added, but doesn't check the predicates). This PR refactors the code so that running predicates is the default behavior when scheduling pods inside the snapshot.
Summary of changes:
Which issue(s) this PR fixes:
The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.
Special notes for your reviewer:
The first commit in the PR is just a squash of #7466 and #7479, and it shouldn't be a part of this review. The PR will be rebased on top of master after the others are merged.
This is intended to be a no-op refactor. It was extracted from #7350 after #7447, #7466, and #7479. This should be the last refactor PR, next ones will introduce actual DRA logic.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: