✨ Add MachineDrainRule "WaitCompleted" #11545

vincepri · 2024-12-05T16:06:49Z

What this PR does / why we need it:

This PR adds the ability for drain to wait for the completion of specific pods. This is useful in scenario where drain is either handled outside the context of kubectl drain after a Node is cordoned, or for long running batch Jobs that should be allowed to terminate on their own.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

api/v1beta1/machinedrainrules_types.go

internal/controllers/machine/drain/drain.go

internal/controllers/machine/drain/filters.go

internal/controllers/machine/drain/drain.go

sbueringer · 2024-12-30T17:40:10Z

internal/controllers/machine/drain/drain.go

@@ -498,6 +507,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
 		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
 			conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
 	}
+	if len(r.PodsToWaitCompleted) > 0 {


@fabriziopandini Is this something that should be specifically handled on higher levels when the condition bubbles up? (like we handled PDB's etc.)

(could also be a follow-up PR)

Yes, you are correct, we have to handle the new use case in:
-

cluster-api/internal/controllers/machine/machine_controller_status.go

Lines 702 to 716 in d8fd3b5

if deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {

delayReasons := []string{}

if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {

delayReasons = append(delayReasons, "PodDisruptionBudgets")

}

if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {

delayReasons = append(delayReasons, "Pods not terminating")

}

if strings.Contains(deletingCondition.Message, "failed to evict Pod") {

delayReasons = append(delayReasons, "Pod eviction errors")

}

if len(delayReasons) > 0 {

msg += fmt.Sprintf(", delay likely due to %s", strings.Join(delayReasons, ", "))

}

}

cluster-api/internal/controllers/machineset/machineset_controller_status.go

Lines 442 to 462 in c333db6

for _, machine := range machines {

if !machine.GetDeletionTimestamp().IsZero() && time.Since(machine.GetDeletionTimestamp().Time) > time.Minute*15 {

machineNames = append(machineNames, machine.GetName())

deletingCondition := v1beta2conditions.Get(machine, clusterv1.MachineDeletingV1Beta2Condition)

if deletingCondition != nil &&

deletingCondition.Status == metav1.ConditionTrue &&

deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason &&

machine.Status.Deletion != nil && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {

if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {

delayReasons.Insert("PodDisruptionBudgets")

}

if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {

delayReasons.Insert("Pods not terminating")

}

if strings.Contains(deletingCondition.Message, "failed to evict Pod") {

delayReasons.Insert("Pod eviction errors")

}

}

}

}

and similar func for MD, KCP

@fabriziopandini This PR or follow-up? I lean towards follow-up, mostly just to not expand the scope of this PR further

api/v1beta1/machinedrainrules_types.go

internal/controllers/machine/drain/drain.go

internal/controllers/machine/drain/filters.go

internal/controllers/machine/drain/drain.go

k8s-ci-robot · 2025-01-07T19:02:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2025-01-09T15:18:18Z

/test ?

k8s-ci-robot · 2025-01-09T15:18:21Z

@vincepri: The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main

/test pull-cluster-api-e2e-blocking-main

/test pull-cluster-api-e2e-conformance-ci-latest-main

/test pull-cluster-api-e2e-conformance-main

/test pull-cluster-api-e2e-latestk8s-main

/test pull-cluster-api-e2e-main

/test pull-cluster-api-e2e-mink8s-main

/test pull-cluster-api-e2e-upgrade-1-32-1-33-main

/test pull-cluster-api-test-main

/test pull-cluster-api-test-mink8s-main

/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main

pull-cluster-api-build-main

pull-cluster-api-e2e-blocking-main

pull-cluster-api-test-main

pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vincepri · 2025-01-09T15:18:33Z

/test pull-cluster-api-e2e-main

vincepri · 2025-01-09T20:11:23Z

/test pull-cluster-api-e2e-main

vincepri · 2025-01-09T20:32:16Z

/test pull-cluster-api-e2e-main

Signed-off-by: Vince Prignano <vince@prigna.com>

vincepri · 2025-01-10T16:04:01Z

/test pull-cluster-api-e2e-main

k8s-ci-robot · 2025-01-10T17:54:37Z

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-e2e-main	`dbf82f8`	link	true	`/test pull-cluster-api-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sbueringer · 2025-01-16T09:50:16Z

api/v1beta1/machinedrainrules_types.go

 	// "Drain" means that the Pods to which this MachineDrainRule applies will be drained.
 	// If behavior is set to "Drain" the order in which Pods are drained can be configured
 	// with the order field. When draining Pods of a Node the Pods will be grouped by order
 	// and one group after another will be drained (by increasing order). Cluster API will
 	// wait until all Pods of a group are terminated / removed from the Node before starting
 	// with the next group.
 	// "Skip" means that the Pods to which this MachineDrainRule applies will be skipped during drain.
+	// "WaitCompleted" means that the pods to which this MachineDrainRule applies will never be evicted


Should we add a sentence that we assume Order 0 and that Order cannot be configured if behavior is WaitCompleted?

sbueringer · 2025-01-16T09:52:48Z

docs/proposals/20240930-machine-drain-rules.md

@@ -147,8 +147,8 @@ metadata:
  namespace: default
 spec:
  drain:
-    # behavior can be Drain or Skip
-    behavior: Drain 
+    # behavior can be Drain, Skip, or WaitCompleted


Let's please add an additional user story above that covers WaitCompleted

sbueringer · 2025-01-16T09:54:43Z

docs/proposals/20240930-machine-drain-rules.md

@@ -259,6 +260,8 @@ The following describes the new algorithm that decides which Pods should be drai
    * \=\> use `behavior: Skip`
  * If the Pod has `cluster.x-k8s.io/drain: skip`
    * \=\> use `behavior: Skip`
+  * If the Pod has `cluster.x-k8s.io/drain: wait-completed`
+    * \=\> use `behavior: WaitCompleted`


Suggested change

* \=\> use `behavior: WaitCompleted`

* \=\> use `behavior: WaitCompleted` and `order: 0`

sbueringer · 2025-01-16T10:02:48Z

internal/controllers/machine/drain/drain.go

@@ -498,6 +507,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
 		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
 			conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
 	}
+	if len(r.PodsToWaitCompleted) > 0 {


@fabriziopandini This PR or follow-up? I lean towards follow-up, mostly just to not expand the scope of this PR further

sbueringer · 2025-01-16T10:04:08Z

internal/controllers/machine/drain/drain.go

@@ -498,6 +512,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
 		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
 			conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
 	}


I think we should move the if for PodsToWaitCompleted above the one for PodsToTriggerEvictionLater

The message for PodsToTriggerEvictionLater contains "After above Pods ..." and this only makes sense if a potential "Waiting for the following Pods..." message is actually above

Hm okay this is a bit more complicated.

The idea is basicallly that this condition message surfaces where the drain is currently stuck (and which Pods we are going to evict next).

The problem with the current behavior is that "WaitCompleted"-Pods are sometimes blocking further progress and sometimes they don't (it depends on the order of the Pods that we have to evict).

I would suggest to treat WaitCompleted Pods similar to PodsToTriggerEviction.

This means splitting them up into podsToWaitCompletedNow and podsToWaitCompletedLater in EvictPods (based on if ptr.Deref(pod.Status.DrainOrder, 0) == minDrainOrder)

I would then also update the log statement in l.303 accordingly.

Once that is done we can also write the condition message here accordingly. I would suggest:

if len(r.PodsToWaitCompletedNow) > 0 { kind := "Pod" if len(r.PodsToWaitCompletedNow) > 1 { kind = "Pods" } // Note: the code computing stale warning for the machine deleting condition is making assumptions on the format/content of this message. // Same applies for other conditions where deleting is involved, e.g. MachineSet's Deleting and ScalingDown condition. conditionMessage = fmt.Sprintf("%s\n* %s %s: waiting for completion", conditionMessage, kind, PodListToString(r.PodsDeletionTimestampSet, 3)) } if len(r.PodsToTriggerEvictionLater) > 0 { conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s", conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3)) } if len(r.PodsToWaitCompletedLater) > 0 { conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, waiting for the following Pods to complete without eviction: %s", conditionMessage, PodListToString(r.PodsToWaitCompletedLater, 3)) }

sbueringer · 2025-01-16T10:31:40Z

test/e2e/node_drain.go

+			Namespace:            "unevictable-workload",
+			NodeSelector:         map[string]string{nodeOwnerLabelKey: "KubeadmControlPlane-" + controlplane.Name},
+			ModifyDeployment: func(deployment *appsv1.Deployment) {
+				deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = string(clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted)


Suggested change

deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = string(clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted)

deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = "wait-completed"

nit: I know both works, but maybe let's test the one that we document in the API (same in l.256)

sbueringer · 2025-01-16T10:36:43Z

test/e2e/node_drain.go

@@ -207,7 +207,7 @@ func NodeDrainTimeoutSpec(ctx context.Context, inputGetter func() NodeDrainTimeo
 		}

 		By("Deploy Deployment with unevictable Pods on control plane and MachineDeployment Nodes.")


Please update the test docs in l.79++

This should be

// NodeDrainTimeoutSpec goes through the following steps: // * Create cluster with 3 CP & 1 worker Machine // * Ensure Node label is set & NodeDrainTimeout is set to 0 (wait forever) // * Deploy MachineDrainRules // * Deploy Deployment with unevictable Pods on CP & MD Nodes // * Deploy Deployment with unevictable Pods with `wait-completed` label on CP & MD Nodes // * Deploy Deployment with evictable Pods with finalizer on CP & MD Nodes // * Deploy additional resources if defined in input // * Trigger Node drain by scaling down the control plane to 1 and MachineDeployments to 0 // * Get draining control plane and MachineDeployment Machines // * Verify drain of Deployments with order 1 // * Verify drain of Deployments with order 5 // * Verify skipped Pods are still there and don't have a deletionTimestamp // * Verify wait-completed Pods are still there and don't have a deletionTimestamp // * Verify Node drains for control plane and MachineDeployment Machines are blocked (by PDBs and WaitCompleted Pods) // * Set NodeDrainTimeout to 1s to unblock Node drain // * Verify machine deletion is blocked by waiting for volume detachment (only if VerifyNodeVolumeDetach is enabled) // * Unblocks waiting for volume detachment (only if VerifyNodeVolumeDetach is enabled) // * Verify scale down succeeded because Node drains were unblocked.

(we'll have to change the order values, see comment below)

sbueringer · 2025-01-16T10:38:18Z

test/e2e/node_drain.go

+			Expect(skippedMDPods.Items).To(HaveLen(1))
+			Expect(skippedMDPods.Items[0].DeletionTimestamp.IsZero()).To(BeTrue())
+		}
+
 		By("Verify Node drains for control plane and MachineDeployment Machines are blocked (only by PDBs)")


Suggested change

By("Verify Node drains for control plane and MachineDeployment Machines are blocked (only by PDBs)")

By("Verify Node drains for control plane and MachineDeployment Machines are blocked (by PDBs and WaitCompleted Pods)")

sbueringer · 2025-01-16T10:41:34Z

test/e2e/node_drain.go

@@ -233,6 +233,31 @@ func NodeDrainTimeoutSpec(ctx context.Context, inputGetter func() NodeDrainTimeo
 				WaitForDeploymentAvailableInterval: input.E2EConfig.GetIntervals(specName, "wait-deployment-available"),
 			})
 		}
+		By("Deploy an unevictable Deployment with `wait-completed` label.")


I think in general the test can work roughly as is.

But as we are now assuming order 0 for WaitCompleted Pods we have to change the order of some other Pods to make sure they can be drained (currently the drain of these Pods is blocked by the waitcompleted Pods).

tl;dr I think the Pods with drain order 1 should get order -5, the Pods with drain order 5 should get order -1

Then all of these Pods can be evicted / drained

sbueringer · 2025-01-16T10:44:42Z

test/framework/deployment_helpers.go

+
+// DeployPodAndWait will deploy a Deployment on a ControlPlane or MachineDeployment.
+func DeployPodAndWait(ctx context.Context, input DeployPodAndWaitInput) {
+	Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployUnevictablePod")


Suggested change

Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployUnevictablePod")

Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployPodAndWait")

nit, same in the few lines below

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2024

k8s-ci-robot requested review from fabriziopandini and JoelSpeed December 5, 2024 16:06

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2024

vincepri force-pushed the pod-drain-waitcompleted branch 2 times, most recently from bab1c1b to 211839e Compare December 11, 2024 19:41

vincepri added the area/machine Issues or PRs related to machine lifecycle management label Dec 12, 2024

k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Dec 12, 2024

vincepri added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. do-not-merge/needs-area PR is missing an area label labels Dec 12, 2024

sbueringer reviewed Dec 30, 2024

View reviewed changes

sbueringer reviewed Dec 31, 2024

View reviewed changes

vincepri changed the title ~~WIP: ✨ Add MachineDrainRule "WaitCompleted"~~ ✨ Add MachineDrainRule "WaitCompleted" Jan 7, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025

vincepri added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/needs-area PR is missing an area label labels Jan 7, 2025

vincepri force-pushed the pod-drain-waitcompleted branch from 211839e to 533803e Compare January 7, 2025 19:02

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025

vincepri removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025

vincepri force-pushed the pod-drain-waitcompleted branch 2 times, most recently from 8826d5b to e22bc4d Compare January 9, 2025 15:18

vincepri force-pushed the pod-drain-waitcompleted branch from e22bc4d to 1ebe3c2 Compare January 9, 2025 20:10

vincepri force-pushed the pod-drain-waitcompleted branch from 1ebe3c2 to ba555de Compare January 9, 2025 20:31

vincepri force-pushed the pod-drain-waitcompleted branch from ba555de to 6650be9 Compare January 9, 2025 21:36

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 9, 2025

✨ Add MachineDrainRule "WaitCompleted"

dbf82f8

Signed-off-by: Vince Prignano <vince@prigna.com>

vincepri force-pushed the pod-drain-waitcompleted branch from 6650be9 to dbf82f8 Compare January 9, 2025 23:51

sbueringer reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add MachineDrainRule "WaitCompleted" #11545

✨ Add MachineDrainRule "WaitCompleted" #11545

vincepri commented Dec 5, 2024 •

edited

Loading

sbueringer Dec 30, 2024

fabriziopandini Jan 13, 2025

sbueringer Jan 16, 2025

k8s-ci-robot commented Jan 7, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

sbueringer Jan 16, 2025

	if deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {
	delayReasons := []string{}
	if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {
	delayReasons = append(delayReasons, "PodDisruptionBudgets")
	}
	if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {
	delayReasons = append(delayReasons, "Pods not terminating")
	}
	if strings.Contains(deletingCondition.Message, "failed to evict Pod") {
	delayReasons = append(delayReasons, "Pod eviction errors")
	}
	if len(delayReasons) > 0 {
	msg += fmt.Sprintf(", delay likely due to %s", strings.Join(delayReasons, ", "))
	}
	}

	for _, machine := range machines {
	if !machine.GetDeletionTimestamp().IsZero() && time.Since(machine.GetDeletionTimestamp().Time) > time.Minute*15 {
	machineNames = append(machineNames, machine.GetName())

	deletingCondition := v1beta2conditions.Get(machine, clusterv1.MachineDeletingV1Beta2Condition)
	if deletingCondition != nil &&
	deletingCondition.Status == metav1.ConditionTrue &&
	deletingCondition.Reason == clusterv1.MachineDeletingDrainingNodeV1Beta2Reason &&
	machine.Status.Deletion != nil && time.Since(machine.Status.Deletion.NodeDrainStartTime.Time) > 5*time.Minute {
	if strings.Contains(deletingCondition.Message, "cannot evict pod as it would violate the pod's disruption budget.") {
	delayReasons.Insert("PodDisruptionBudgets")
	}
	if strings.Contains(deletingCondition.Message, "deletionTimestamp set, but still not removed from the Node") {
	delayReasons.Insert("Pods not terminating")
	}
	if strings.Contains(deletingCondition.Message, "failed to evict Pod") {
	delayReasons.Insert("Pod eviction errors")
	}
	}
	}
	}

	* \=\> use `behavior: WaitCompleted`
	* \=\> use `behavior: WaitCompleted` and `order: 0`

	deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = string(clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted)
	deployment.Spec.Template.Labels["cluster.x-k8s.io/drain"] = "wait-completed"

		@@ -207,7 +207,7 @@ func NodeDrainTimeoutSpec(ctx context.Context, inputGetter func() NodeDrainTimeo
		}

		By("Deploy Deployment with unevictable Pods on control plane and MachineDeployment Nodes.")

	By("Verify Node drains for control plane and MachineDeployment Machines are blocked (only by PDBs)")
	By("Verify Node drains for control plane and MachineDeployment Machines are blocked (by PDBs and WaitCompleted Pods)")

	Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployUnevictablePod")
	Expect(input.DeploymentName).ToNot(BeNil(), "Need a deployment name in DeployPodAndWait")

✨ Add MachineDrainRule "WaitCompleted" #11545

Are you sure you want to change the base?

✨ Add MachineDrainRule "WaitCompleted" #11545

Conversation

vincepri commented Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 7, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 9, 2025

vincepri commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Dec 5, 2024 •

edited

Loading