[0.1] Machine controller: drain node before machine deletion #1103

michaelgugino · 2019-07-02T18:46:59Z

What this PR does / why we need it:
Centralizes node-drain behavior into cluster-api instead of down-stream actuators.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #994

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

action required: machine-controller now attempts to cordon and drain nodes before deletion.  Actuators should be updated to remove this behavior.

michaelgugino · 2019-07-02T18:48:55Z

pkg/controller/machine/machine_controller.go

+
+		// Drain node before deletion
+		// If a machine is not linked to a node, just delete the machine.
+		if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil {


Can't combine in this branch with m.Stat.NodeRef != nil below because we delete instance before deleting node.

pkg/apis/cluster/v1alpha1/machine_types.go

ncdc · 2019-07-03T15:23:46Z

@michaelgugino does the OpenShift drain library respect pod disruption budgets? What happens if some pod sticks around and won't go away?

detiber · 2019-07-03T15:26:00Z

/lgtm
/assign @vincepri

michaelgugino · 2019-07-03T15:28:02Z

@michaelgugino does the OpenShift drain library respect pod disruption budgets? What happens if some pod sticks around and won't go away?

@ncdc yes, it does respect them. If a pod cannot be evicted due to disruption budgets, the drain cannot complete, and the request is requeued by the machine-controller. In cases where it can never complete (in our case, we have etcd-quorum on 3 masters, pdb=1, and if you try delete more than 1 master, you fail), it will just keep requeuing and not forcefully delete anything.

k8s-ci-robot · 2019-07-10T12:55:57Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2019-07-10T12:56:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelgugino
To complete the pull request process, please assign detiber
You can assign the PR to them by writing /assign @detiber in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chuckha

We should find an alternative library or pull in the drain functionality from kubectl if the imports aren't terrible (note: they probably are)

chuckha · 2019-07-10T13:06:07Z

pkg/controller/machine/machine_controller.go


+	"github.com/go-log/log/info"
+	kubedrain "github.com/openshift/kubernetes-drain"


I have some reservations about this import. The project itself looks abandoned, there is no README, the last commit was 10 months ago and it is not structured like a standard go library. It does not use any dependency management and it is pulling in an additional logging library that we don't want.

We should avoid importing repos like this that are likely to go stale/are unmaintained. I don't want sig-SCL to adopt maintenance of this library, we already have plenty to keep up to date.

It's not likely to go stale or unmaintained because we ship it in OpenShift. In any case, this is no different than vendoring in any number of dependencies that you see in the wild. Here's an example of said 'non standard' type library: https://github.com/petar/GoLLRB

It's actually not pulling in an additional logging library, that library is actually just a simple interface (discussed previously).

I've looked into adding support from the kubectl bits as well, we decided against that already if you check out the resolved conversations.

If there's another library you can suggest, I'm happy to use that instead. Or we could vendor the code here so it doesn't go stale. Or we can be good consumers and consume from the community and contribute fixes back.

How hard would it be to vendor both the cmd and library portions from k/k?

@ncdc I looked into that as well. The cmd portion is the tricky part, it imports a lot of other things from k/k. So, it would need a total refactor. I started down this road, but it quickly became a mess.

I think we should either use it as-is, or we should copy the code into cluster-api. I'm in favor of the former, and then when the kubectl stuff gets pulled into it's own project, we can consume that instead.

@ncdc I confirm this is a bug. We're not actually waiting for the pod to be deleted it seems, and this is not what we want.

chuckha · 2019-07-10T16:47:34Z

Fine with me to merge the open shift library for this version.

ncdc · 2019-07-10T20:49:20Z

/hold

I am reviewing the openshift drain code for comparison with kubernetes/kubernetes. I'll remove the hold once I'm done.

ncdc · 2019-07-10T21:26:14Z

@michaelgugino do you know why this is ignoring the pod's namespace and switching to use the one defined in the drain options?

getPodFn := func(namespace, name string) (*corev1.Pod, error) {
	return client.CoreV1().Pods(options.Namespace).Get(name, metav1.GetOptions{})
}

https://github.com/openshift/kubernetes-drain/blob/c2e51be1758efa30d71a4d30dc4e2db86b70a4df/drain.go#L401-L403

michaelgugino · 2019-07-10T22:37:06Z

@michaelgugino do you know why this is ignoring the pod's namespace and switching to use the one defined in the drain options?
getPodFn := func(namespace, name string) (*corev1.Pod, error) {
	return client.CoreV1().Pods(options.Namespace).Get(name, metav1.GetOptions{})
}
https://github.com/openshift/kubernetes-drain/blob/c2e51be1758efa30d71a4d30dc4e2db86b70a4df/drain.go#L401-L403

@ncdc based on the git-blame, it looks like it's a bug, but it's definitely the code we're using. I'll confirm.

The node draining code itself is imported from github.com/openshift/kubernetes-drain.

This commit updates kubernetes-drain to remove namespace bug. kubernetes-drain pr: openshift/kubernetes-drain#1

michaelgugino · 2019-07-12T12:48:51Z

@ncdc updated the kubernetes-drain library. PTAL.

k8s-ci-robot · 2019-07-12T12:50:46Z

@michaelgugino: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-cluster-api-integration	`e629901`	link	`/test pull-cluster-api-integration`
pull-cluster-api-make	`e629901`	link	`/test pull-cluster-api-make`
pull-cluster-api-build	`e629901`	link	`/test pull-cluster-api-build`
pull-cluster-api-test	`e629901`	link	`/test pull-cluster-api-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2019-08-01T16:46:48Z

@michaelgugino: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

detiber · 2019-08-21T13:09:13Z

Now that kubernetes/kubernetes#80045 has merged, can you please update to using a temporary copy of that implementation?

michaelgugino · 2019-08-21T14:14:43Z

Now that kubernetes/kubernetes#80045 has merged, can you please update to using a temporary copy of that implementation?

@detiber will do. I'll try to get to it this afternoon.

ncdc · 2019-08-23T17:04:30Z

@michaelgugino hi! Do you think you'll be able to get to this today or early next week?

ncdc · 2019-10-07T20:03:51Z

We are no longer working on v0.1.x, and #1096 will add this to v0.2.x.
/close

k8s-ci-robot · 2019-10-07T20:03:53Z

@ncdc: Closed this PR.

In response to this:

We are no longer working on v0.1.x, and #1096 will add this to v0.2.x.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2019

k8s-ci-robot requested review from justinsb and vincepri July 2, 2019 18:47

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 2, 2019

michaelgugino commented Jul 2, 2019

View reviewed changes

pkg/apis/cluster/v1alpha1/machine_types.go Outdated Show resolved Hide resolved

michaelgugino force-pushed the nd-v0.1-2 branch 2 times, most recently from 44dd5ab to bc50543 Compare July 2, 2019 19:35

k8s-ci-robot assigned vincepri and detiber Jul 3, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2019

ncdc changed the title ~~Machine controller: drain node before machine deletion~~ [0.1] Machine controller: drain node before machine deletion Jul 5, 2019

ncdc mentioned this pull request Jul 5, 2019

Node cordoning & draining vmware-archive/cluster-api-upgrade-tool#5

Closed

michaelgugino force-pushed the nd-v0.1-2 branch from bc50543 to e9d5bcd Compare July 10, 2019 12:55

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2019

chuckha suggested changes Jul 10, 2019

View reviewed changes

ncdc mentioned this pull request Jul 10, 2019

[0.1] Reduce Reconcile() complexity with needUpdateFinalizers #1140

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2019

ingvagabund and others added 3 commits July 12, 2019 08:38

Machine controller: drain node before machine deletion

ddd5a01

The node draining code itself is imported from github.com/openshift/kubernetes-drain.

Add drain support for remote nodeRef

1170f24

vendor: update kubernetes-drain lib

e629901

This commit updates kubernetes-drain to remove namespace bug. kubernetes-drain pr: openshift/kubernetes-drain#1

michaelgugino force-pushed the nd-v0.1-2 branch from e9d5bcd to e629901 Compare July 12, 2019 12:47

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 1, 2019

detiber modified the milestones: v0.2.x (v1alpha2), v0.1.x (v1alpha1) Aug 29, 2019

k8s-ci-robot closed this Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.1] Machine controller: drain node before machine deletion #1103

[0.1] Machine controller: drain node before machine deletion #1103

michaelgugino commented Jul 2, 2019

michaelgugino Jul 2, 2019

ncdc commented Jul 3, 2019

detiber commented Jul 3, 2019

michaelgugino commented Jul 3, 2019

k8s-ci-robot commented Jul 10, 2019

k8s-ci-robot commented Jul 10, 2019

chuckha left a comment

chuckha Jul 10, 2019

michaelgugino Jul 10, 2019

ncdc Jul 10, 2019

michaelgugino Jul 10, 2019

michaelgugino Jul 11, 2019

chuckha commented Jul 10, 2019

ncdc commented Jul 10, 2019

ncdc commented Jul 10, 2019

michaelgugino commented Jul 10, 2019

michaelgugino commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

k8s-ci-robot commented Aug 1, 2019

detiber commented Aug 21, 2019

michaelgugino commented Aug 21, 2019

ncdc commented Aug 23, 2019

ncdc commented Oct 7, 2019

k8s-ci-robot commented Oct 7, 2019


		"github.com/go-log/log/info"
		kubedrain "github.com/openshift/kubernetes-drain"

[0.1] Machine controller: drain node before machine deletion #1103

[0.1] Machine controller: drain node before machine deletion #1103

Conversation

michaelgugino commented Jul 2, 2019

michaelgugino Jul 2, 2019

Choose a reason for hiding this comment

ncdc commented Jul 3, 2019

detiber commented Jul 3, 2019

michaelgugino commented Jul 3, 2019

k8s-ci-robot commented Jul 10, 2019

k8s-ci-robot commented Jul 10, 2019

chuckha left a comment

Choose a reason for hiding this comment

chuckha Jul 10, 2019

Choose a reason for hiding this comment

michaelgugino Jul 10, 2019

Choose a reason for hiding this comment

ncdc Jul 10, 2019

Choose a reason for hiding this comment

michaelgugino Jul 10, 2019

Choose a reason for hiding this comment

michaelgugino Jul 11, 2019

Choose a reason for hiding this comment

chuckha commented Jul 10, 2019

ncdc commented Jul 10, 2019

ncdc commented Jul 10, 2019

michaelgugino commented Jul 10, 2019

michaelgugino commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

k8s-ci-robot commented Aug 1, 2019

detiber commented Aug 21, 2019

michaelgugino commented Aug 21, 2019

ncdc commented Aug 23, 2019

ncdc commented Oct 7, 2019

k8s-ci-robot commented Oct 7, 2019