CA: refactor PredicateChecker into ClusterSnapshot #7497

towca · 2024-11-14T15:16:43Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.

To handle DRA properly, scheduling predicates/filters always need to be run whenever scheduling a pod to a node inside the snapshot (so that the DRA scheduler plugin can compute the necessary allocation). The way that the code is structured currently doesn't make this requirement obvious, and we risk future changes breaking DRA behavior (e.g. new logic that schedules pods inside the snapshot gets added, but doesn't check the predicates). This PR refactors the code so that running predicates is the default behavior when scheduling pods inside the snapshot.

Summary of changes:

All PredicateChecker methods need ClusterSnapshot, and ClusterSnapshot needs PredicateChecker for the pod-scheduling methods. IMO it makes the most sense to make PredicateChecker an implementation detail of ClusterSnapshot, so that the users don't have to coordinate the two.
The predicate-checking logic in ClusterSnapshot would be the same for Basic and Delta implementations, so another layer of abstraction (PredicateSnapshot) is introduced on top of them, and the logic is implemented there.
The changes above require separating the clustersnapshot files into multiple subpackages. This also makes for a more readable code structure.
A bunch of test code throughout the CA code base needs to be adapted.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.

Special notes for your reviewer:

The first commit in the PR is just a squash of #7466 and #7479, and it shouldn't be a part of this review. The PR will be rebased on top of master after the others are merged.

This is intended to be a no-op refactor. It was extracted from #7350 after #7447, #7466, and #7479. This should be the last refactor PR, next ones will introduce actual DRA logic.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

k8s-ci-robot · 2024-11-14T15:17:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

towca · 2024-11-14T15:36:49Z

/hold

towca · 2024-11-19T14:14:41Z

/assign @BigDarkClown

DONOTSUBMIT

jackfrancis · 2024-11-25T16:15:17Z

cluster-autoscaler/core/podlistprocessor/filter_out_expendable.go

@@ -56,6 +56,7 @@ func (p *filterOutExpendable) Process(context *context.AutoscalingContext, pods
 // CA logic from before migration to scheduler framework. So let's keep it for now
 func (p *filterOutExpendable) addPreemptingPodsToSnapshot(pods []*apiv1.Pod, ctx *context.AutoscalingContext) error {
 	for _, p := range pods {
+		// TODO(DRA): Figure out if/how to use the predicate-checking SchedulePod() here instead - otherwise this doesn't work with DRA pods.


is this post-v1.32 TODO work?

Yes, I listed all of the post-MVP work in #7530 description - this is the "Priority-based preempting pods using DRA" part.

jackfrancis · 2024-11-25T16:28:11Z

cluster-autoscaler/simulator/cluster.go

@@ -223,7 +221,7 @@ func (r *RemovalSimulator) findPlaceFor(removedNode string, pods []*apiv1.Pod, n

 	// remove pods from clusterSnapshot first


should we change this comment to "// unscheduled pods from clusterSnapshot first" ?

Good catch, updated the comment!

jackfrancis · 2024-11-25T16:33:25Z

cluster-autoscaler/simulator/clustersnapshot/predicate/plugin_runner.go

@@ -0,0 +1,126 @@
+/*
+Copyright 2016 The Kubernetes Authors.


nit: new files should read 2024 (let's hope we land this in 2024 :))

Done, I think I got them all

jackfrancis · 2024-11-25T16:45:29Z

cluster-autoscaler/simulator/clustersnapshot/predicate/plugin_runner.go

+			continue
+		}
+
+		if !preFilterResult.AllNodes() && !preFilterResult.NodeNames.Has(nodeInfo.Node().Name) {


Is this an opportunity to rename the AllNodes method to something more descriptive like AllNodesAreEligible?

It also might be helpful for future maintainers to add a comment above these initial two if statements:

// Ensure that this node in the iteration fulfills the passed in nodeMatches filter func

// If only certain nodes are capable of running this pod, // and if this node in the iteration isn't one of them, try the next node

We can't rename AllNodes as that's a part of the scheduler framework, but I certainly agree that we could use some comments here.

Added a bunch of comments, and it made me realize that for some reason we're not checking the PreFilter result in the single-node CheckPredicates/RunFiltersOnNode method. This seems like a bug, so I added the check. I also moved the checks around in RunFiltersUntilPassingNode so that the cheapest ones are checked earliest.

WDYT?

this all looks good

BigDarkClown · 2024-12-02T14:24:42Z

cluster-autoscaler/simulator/clustersnapshot/delta_benchmark_test.go

+			b.ResetTimer()
+			for i := 0; i < b.N; i++ {
+				list := clusterSnapshot.data.buildNodeInfoList()
+				if len(list) != tc.nodeCount+1000 {


Soooooo, if X != Y we assert that X == Y? I don't fully understand what is happening here, could you explain?

I have as much of an idea as you here, I'm just moving this from clustersnapshot_test.go 😅 Looking into git history it's been this way since the test was introduced, I assume this was just supposed to assert the list length. Removed the if and left just the assert, thanks for catching!

BigDarkClown · 2024-12-02T14:27:35Z

cluster-autoscaler/core/podlistprocessor/currently_drained_nodes_test.go

+				BuildTestNode("n2", 3000, 10),
+				BuildTestNode("n3", 3000, 10),
+				BuildTestNode("n4", 3000, 10),
+				BuildTestNode("n5", 3000, 10),


Why did you need to update the test constants?

Hmm, this must be a leftover from an older iteration where InitializeClusterSnapshotOrDie used SchedulePod instead of ForceAddPod. It's not needed, reverted to reduce confusion.

BigDarkClown · 2024-12-02T14:31:33Z

cluster-autoscaler/core/autoscaler.go

@@ -127,7 +127,7 @@ func initializeDefaultOptions(opts *AutoscalerOptions, informerFactory informers
 		opts.FrameworkHandle = fwHandle
 	}
 	if opts.ClusterSnapshot == nil {
-		opts.ClusterSnapshot = predicate.NewPredicateSnapshot(base.NewBasicClusterSnapshot(), opts.FrameworkHandle)
+		opts.ClusterSnapshot = predicate.NewPredicateSnapshot(base.NewBasicSnapshotBase(), opts.FrameworkHandle)


Same comment as to others - I think NewBasicClusterStorage would be better, or NewBasicClusterState. SnapshotBase does not really mean anything.

Done (NewBasicClusterStore())

BigDarkClown · 2024-12-02T15:14:17Z

cluster-autoscaler/simulator/clustersnapshot/predicate/plugin_runner_test.go

@@ -293,9 +278,9 @@ func TestDebugInfo(t *testing.T) {
 	assert.NoError(t, err)

 	// with default predicate checker
-	defaultPredicateChecker, err := newTestPredicateChecker()
+	defaultPluginnRunner, err := newTestPluginRunner(clusterSnapshot, nil)


typo: defaultPluginRunner

BigDarkClown · 2024-12-02T15:16:17Z

cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot.go

 )

 // PredicateSnapshot implements ClusterSnapshot on top of a SnapshotBase by using
 // SchedulerBasedPredicateChecker to check scheduler predicates.
 type PredicateSnapshot struct {
 	clustersnapshot.SnapshotBase
-	predicateChecker *predicatechecker.SchedulerBasedPredicateChecker
+	pluginRunner *SchedulerPluginRunner


nit: Maybe this could be using some abstract PluginRunner interface? Would make it easier to run unit tests.

I don't like the idea of mocking such a dependency, IMO it just makes tests less useful. Currently, unit tests can just do testsnapshot.NewTestSnapshotOrDie() to get a snapshot that will actually behave ~the same as in production code. If they were to mock the plugin runner under the snapshot instead, they would have to maintain the mock, and the snapshot could behave differently than in production code (e.g. SchedulePod() letting a Pod in that PredicateSnapshot normally wouldn't because of the predicates failing).

In general IMO we should mock/fake at the lowest possible level so that as much of the actual implementation as possible is tested. In this case, the framework.Handle level makes the most sense to me, as we can easily fake it and there isn't almost any actual logic there. NewTestSnapshotOrDie() just encapsulates that.

jackfrancis · 2024-12-02T22:30:22Z

cluster-autoscaler/simulator/clustersnapshot/test_utils.go

+	return CreateTestPodsWithPrefix("p", n)
+}
+
+// AssignTestPodsToNodes assigns test pods to test nodes based on their index position.


I would probably rename this comment to

// AssignTestPodsToNodes distributes test pods evenly across test nodes

jackfrancis · 2024-12-02T22:34:18Z

cluster-autoscaler/simulator/framework/handle.go

+	"fmt"
+
+	"k8s.io/client-go/informers"
+	"k8s.io/kubernetes/pkg/scheduler/apis/config"


yikes having "k8s.io/kubernetes/pkg/scheduler/apis/config" and "k8s.io/kubernetes/pkg/scheduler/apis/config/latest" in the same import block

I think this is probably the best solution

scheduler_config "k8s.io/kubernetes/pkg/scheduler/apis/config" scheduler_config_latest "k8s.io/kubernetes/pkg/scheduler/apis/config/latest"

Done (but removed the underscores for consistency)

jackfrancis · 2024-12-02T22:55:32Z

cluster-autoscaler/simulator/clustersnapshot/predicate/plugin_runner.go

+func (p *SchedulerPluginRunner) RunFiltersUntilPassingNode(pod *apiv1.Pod, nodeMatches func(*framework.NodeInfo) bool) (string, clustersnapshot.SchedulingError) {
+	nodeInfosList, err := p.snapshotBase.ListNodeInfos()
+	if err != nil {
+		return "", clustersnapshot.NewSchedulingInternalError(pod, "ClusterSnapshot not provided")


Is "ClusterSnapshot not provided" the right message here for this error condition?

Nope, that's my bad. Changed to an appropriate message about listing node infos.

jackfrancis · 2024-12-02T22:57:00Z

cluster-autoscaler/simulator/scheduling/hinting_simulator.go

 		if err != nil {
 			return nil, 0, err
 		}

 		if nodeName == "" {
-			nodeName = s.findNode(similarPods, clusterSnapshot, pod, loggingQuota, isNodeAcceptable)
+			nodeName, err = s.trySchedule(similarPods, clusterSnapshot, pod, loggingQuota, isNodeAcceptable)


I'm curious what the upstream ramifications of this new error condition have. Would it ever make sense to have breakOnFailure set to false here?

My intention was for trySchedule and tryScheduleUsingHints to only fail on the "unexpected" errors (to cover the ForceAddPod error condition that was previously below, and now is a part of SchedulePod, which is called in trySchedule and tryScheduleUsingHints), and return "", nil if the predicates don't pass. So if I got that right breakOnFailure, and TrySchedulePods in general should work the same as before (modulo the "truly unexpected" errors from SchedulerPluginRunner that I mention in the other comment).

@BigDarkClown Could you verify this one as well?

As I read it, the difference is that instead of the previous flow (find node using hints -> find node -> add pod) we moved to (find node using hints and schedule pod -> find node and schedule pod) in a single call. I think this is semantically the same, the only difference is that the TrySchedulePods will exit earlier in its body than before , but still on the same errors (scheduling). This looks good to me.

I added comments to the trySchedule methods and refactored tryScheduleUsingHints a bit to make things clearer here.

comments help here, I don't love the density of stuff in these flows but that's all inherited, the changes here are appropriately evolutionary, thx!

jackfrancis · 2024-12-02T23:02:12Z

cluster-autoscaler/simulator/scheduling/hinting_simulator.go

-		return ""
+		return "", nil
+	} else if err != nil {
+		// Unexpected error.


Do we want to classify SchedulingInternalError, FailingPredicateError as "Unexpected error" like this? The latter in particular seems like we might want to set to unschedulable like we're doing on L113 above.

Yeah, this check is definitely missing FailingPredicateError - my bad, added.

Changed this to check against SchedulingInternalError following Bartek's comment.

…hecker This decouples PredicateChecker from the Framework initialization logic, and allows creating multiple PredicateChecker instances while only initializing the framework once. This commit also fixes how CA integrates with Framework metrics. Instead of Registering them they're only Initialized so that CA doesn't expose scheduler metrics. And the initialization is moved from multiple different places to the Handle constructor.

To handle DRA properly, scheduling predicates will need to be run whenever Pods are scheduled in the snapshot. PredicateChecker always needs a ClusterSnapshot to work, and ClusterSnapshot scheduling methods need to run the predicates first. So it makes most sense to have PredicateChecker be a dependency for ClusterSnapshot implementations, and move the PredicateChecker methods to ClusterSnapshot. This commit mirrors PredicateChecker methods in ClusterSnapshot (with the exception of FitsAnyNode which isn't used anywhere and is trivial to do via FitsAnyNodeMatching). Further commits will remove the PredicateChecker interface and move the implementation under clustersnapshot. Dummy methods are added to current ClusterSnapshot implementations to get the tests to pass. Further commits will actually implement them. PredicateError is refactored into a broader SchedulingError so that the ClusterSnapshot methods can return a single error that the callers can use to distinguish between a failing predicate and other, unexpected errors.

PredicateSnapshot implements the ClusterSnapshot methods that need to run predicates on top of a ClusterSnapshotStore. testsnapshot pkg is introduced, providing functions abstracting away the snapshot creation for tests. ClusterSnapshot tests are moved near PredicateSnapshot, as it'll be the only "full" implementation.

… subpkg

…he ClusterSnapshotStore change

For DRA, this component will have to call the Reserve phase in addition to just checking predicates/filters. The new version also makes more sense in the context of PredicateSnapshot, which is the only context now. While refactoring, I noticed that CheckPredicates for some reason doesn't check the provided Node against the eligible Nodes returned from PreFilter (while FitsAnyNodeMatching does do that). This seems like a bug, so the check is added. The checks in FitsAnyNodeMatching are also reordered so that the cheapest ones are checked earliest.

BigDarkClown · 2024-12-04T14:46:35Z

/lgtm
/hold

DONOTSUBMIT

jackfrancis · 2024-12-04T16:29:37Z

cluster-autoscaler/estimator/binpacking_estimator.go

-			// Check schedulability on only newly created node
-			if err := e.predicateChecker.CheckPredicates(e.clusterSnapshot, pod, estimationState.lastNodeName); err == nil {
+			// Try to schedule the pod on only newly created node.
+			if err := e.clusterSnapshot.SchedulePod(pod, estimationState.lastNodeName); err == nil {


this single imperative function call w/ err response is is much cleaner now 👍

jackfrancis · 2024-12-04T16:32:14Z

cluster-autoscaler/estimator/binpacking_estimator.go

 			}
+			// The pod can't be scheduled on the newly created node because of scheduling predicates.


nit: this comment would be better right above L177

jackfrancis · 2024-12-04T16:43:00Z

my only open comment is a nit about comment placemnt, not worth holding

/lgtm
/hold cancel

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 14, 2024

k8s-ci-robot added the area/cluster-autoscaler label Nov 14, 2024

k8s-ci-robot requested review from feiskyer and vadasambar November 14, 2024 15:17

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2024

towca force-pushed the jtuznik/dra-predicate-snapshot branch 2 times, most recently from ed9232e to 27420ef Compare November 14, 2024 15:34

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2024

towca force-pushed the jtuznik/dra-predicate-snapshot branch 2 times, most recently from e377759 to d84511f Compare November 19, 2024 14:13

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024

k8s-ci-robot assigned BigDarkClown Nov 19, 2024

towca force-pushed the jtuznik/dra-predicate-snapshot branch from d84511f to d78b5d8 Compare November 19, 2024 14:35

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024

towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024

TMP squash: kubernetes#7466, kubernetes#7479, kubernetes#7497

3036cfd

DONOTSUBMIT

towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024

TMP squash: kubernetes#7466, kubernetes#7479, kubernetes#7497

0a00f6b

DONOTSUBMIT

towca force-pushed the jtuznik/dra-predicate-snapshot branch from d78b5d8 to e4d5002 Compare November 21, 2024 18:48

towca added a commit to towca/autoscaler that referenced this pull request Nov 25, 2024

TMP squash: kubernetes#7479, kubernetes#7497

3d041c5

DONOTSUBMIT

towca mentioned this pull request Nov 25, 2024

CA: prepare for DRA integration #7529

Merged

towca added a commit to towca/autoscaler that referenced this pull request Nov 25, 2024

TMP squash: kubernetes#7479, kubernetes#7497, kubernetes#7529

0ec9524

DONOTSUBMIT

towca mentioned this pull request Nov 25, 2024

CA: DRA integration MVP #7530

Merged

jackfrancis reviewed Nov 25, 2024

View reviewed changes

BigDarkClown reviewed Dec 2, 2024

View reviewed changes

jackfrancis reviewed Dec 2, 2024

View reviewed changes

towca force-pushed the jtuznik/dra-predicate-snapshot branch from 2a387dd to 0755451 Compare December 3, 2024 19:18

towca added 7 commits December 4, 2024 14:33

CA: migrate the codebase to use PredicateSnapshot

5407252

CA: move BasicClusterSnapshot and DeltaClusterSnapshot to a dedicated…

67773a5

… subpkg

CA: rename BasicClusterSnapshot and DeltaClusterSnapshot to reflect t…

0ace148

…he ClusterSnapshotStore change

CA: remove PredicateChecker, use the new ClusterSnapshot methods instead

6876289

towca force-pushed the jtuznik/dra-predicate-snapshot branch from 0755451 to 054d5d2 Compare December 4, 2024 13:37

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 4, 2024

towca added a commit to towca/autoscaler that referenced this pull request Dec 4, 2024

TMP squash: kubernetes#7497

934ddf0

DONOTSUBMIT

towca added a commit to towca/autoscaler that referenced this pull request Dec 4, 2024

TMP squash: kubernetes#7497, kubernetes#7529

43bee5c

DONOTSUBMIT

jackfrancis reviewed Dec 4, 2024

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2024

k8s-ci-robot assigned jackfrancis Dec 4, 2024

k8s-ci-robot merged commit 0fc5c40 into kubernetes:master Dec 4, 2024
6 checks passed

		@@ -223,7 +221,7 @@ func (r RemovalSimulator) findPlaceFor(removedNode string, pods []apiv1.Pod, n

		// remove pods from clusterSnapshot first

		}
		// The pod can't be scheduled on the newly created node because of scheduling predicates.

CA: refactor PredicateChecker into ClusterSnapshot #7497

CA: refactor PredicateChecker into ClusterSnapshot #7497

Conversation

towca commented Nov 14, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Nov 14, 2024

towca commented Nov 14, 2024

towca commented Nov 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigDarkClown Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigDarkClown Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigDarkClown commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackfrancis commented Dec 4, 2024

towca commented Nov 14, 2024 •

edited

Loading

BigDarkClown Dec 2, 2024 •

edited

Loading

BigDarkClown Dec 4, 2024 •

edited

Loading