Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.1] Machine controller: drain node before machine deletion #1103

Conversation

michaelgugino
Copy link
Contributor

What this PR does / why we need it:
Centralizes node-drain behavior into cluster-api instead of down-stream actuators.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #994

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

action required: machine-controller now attempts to cordon and drain nodes before deletion.  Actuators should be updated to remove this behavior.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2019
@k8s-ci-robot k8s-ci-robot requested review from justinsb and vincepri July 2, 2019 18:47
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 2, 2019

// Drain node before deletion
// If a machine is not linked to a node, just delete the machine.
if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't combine in this branch with m.Stat.NodeRef != nil below because we delete instance before deleting node.

@michaelgugino michaelgugino force-pushed the nd-v0.1-2 branch 2 times, most recently from 44dd5ab to bc50543 Compare July 2, 2019 19:35
@ncdc
Copy link
Contributor

ncdc commented Jul 3, 2019

@michaelgugino does the OpenShift drain library respect pod disruption budgets? What happens if some pod sticks around and won't go away?

@detiber
Copy link
Member

detiber commented Jul 3, 2019

/lgtm
/assign @vincepri

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2019
@michaelgugino
Copy link
Contributor Author

@michaelgugino does the OpenShift drain library respect pod disruption budgets? What happens if some pod sticks around and won't go away?

@ncdc yes, it does respect them. If a pod cannot be evicted due to disruption budgets, the drain cannot complete, and the request is requeued by the machine-controller. In cases where it can never complete (in our case, we have etcd-quorum on 3 masters, pdb=1, and if you try delete more than 1 master, you fail), it will just keep requeuing and not forcefully delete anything.

@ncdc ncdc changed the title Machine controller: drain node before machine deletion [0.1] Machine controller: drain node before machine deletion Jul 5, 2019
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelgugino
To complete the pull request process, please assign detiber
You can assign the PR to them by writing /assign @detiber in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find an alternative library or pull in the drain functionality from kubectl if the imports aren't terrible (note: they probably are)


"github.com/go-log/log/info"
kubedrain "github.com/openshift/kubernetes-drain"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservations about this import. The project itself looks abandoned, there is no README, the last commit was 10 months ago and it is not structured like a standard go library. It does not use any dependency management and it is pulling in an additional logging library that we don't want.

We should avoid importing repos like this that are likely to go stale/are unmaintained. I don't want sig-SCL to adopt maintenance of this library, we already have plenty to keep up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not likely to go stale or unmaintained because we ship it in OpenShift. In any case, this is no different than vendoring in any number of dependencies that you see in the wild. Here's an example of said 'non standard' type library: https://github.com/petar/GoLLRB

It's actually not pulling in an additional logging library, that library is actually just a simple interface (discussed previously).

I've looked into adding support from the kubectl bits as well, we decided against that already if you check out the resolved conversations.

If there's another library you can suggest, I'm happy to use that instead. Or we could vendor the code here so it doesn't go stale. Or we can be good consumers and consume from the community and contribute fixes back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to vendor both the cmd and library portions from k/k?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc I looked into that as well. The cmd portion is the tricky part, it imports a lot of other things from k/k. So, it would need a total refactor. I started down this road, but it quickly became a mess.

I think we should either use it as-is, or we should copy the code into cluster-api. I'm in favor of the former, and then when the kubectl stuff gets pulled into it's own project, we can consume that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc I confirm this is a bug. We're not actually waiting for the pod to be deleted it seems, and this is not what we want.

@chuckha
Copy link
Contributor

chuckha commented Jul 10, 2019

Fine with me to merge the open shift library for this version.

@ncdc
Copy link
Contributor

ncdc commented Jul 10, 2019

/hold

I am reviewing the openshift drain code for comparison with kubernetes/kubernetes. I'll remove the hold once I'm done.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2019
@ncdc
Copy link
Contributor

ncdc commented Jul 10, 2019

@michaelgugino do you know why this is ignoring the pod's namespace and switching to use the one defined in the drain options?

getPodFn := func(namespace, name string) (*corev1.Pod, error) {
	return client.CoreV1().Pods(options.Namespace).Get(name, metav1.GetOptions{})
}

https://github.com/openshift/kubernetes-drain/blob/c2e51be1758efa30d71a4d30dc4e2db86b70a4df/drain.go#L401-L403

@michaelgugino
Copy link
Contributor Author

@michaelgugino do you know why this is ignoring the pod's namespace and switching to use the one defined in the drain options?

getPodFn := func(namespace, name string) (*corev1.Pod, error) {
	return client.CoreV1().Pods(options.Namespace).Get(name, metav1.GetOptions{})
}

https://github.com/openshift/kubernetes-drain/blob/c2e51be1758efa30d71a4d30dc4e2db86b70a4df/drain.go#L401-L403

@ncdc based on the git-blame, it looks like it's a bug, but it's definitely the code we're using. I'll confirm.

ingvagabund and others added 3 commits July 12, 2019 08:38
The node draining code itself is imported from github.com/openshift/kubernetes-drain.
This commit updates kubernetes-drain to remove namespace
bug.

kubernetes-drain pr: openshift/kubernetes-drain#1
@michaelgugino
Copy link
Contributor Author

@ncdc updated the kubernetes-drain library. PTAL.

@k8s-ci-robot
Copy link
Contributor

@michaelgugino: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-cluster-api-integration e629901 link /test pull-cluster-api-integration
pull-cluster-api-make e629901 link /test pull-cluster-api-make
pull-cluster-api-build e629901 link /test pull-cluster-api-build
pull-cluster-api-test e629901 link /test pull-cluster-api-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link
Contributor

@michaelgugino: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 1, 2019
@detiber
Copy link
Member

detiber commented Aug 21, 2019

Now that kubernetes/kubernetes#80045 has merged, can you please update to using a temporary copy of that implementation?

@michaelgugino
Copy link
Contributor Author

Now that kubernetes/kubernetes#80045 has merged, can you please update to using a temporary copy of that implementation?

@detiber will do. I'll try to get to it this afternoon.

@ncdc
Copy link
Contributor

ncdc commented Aug 23, 2019

@michaelgugino hi! Do you think you'll be able to get to this today or early next week?

@ncdc
Copy link
Contributor

ncdc commented Oct 7, 2019

We are no longer working on v0.1.x, and #1096 will add this to v0.2.x.
/close

@k8s-ci-robot
Copy link
Contributor

@ncdc: Closed this PR.

In response to this:

We are no longer working on v0.1.x, and #1096 will add this to v0.2.x.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants