Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider critical-components-not-ready taint when machine joins first time #778

Merged
merged 7 commits into from
Feb 17, 2023

Conversation

SimonKienzler
Copy link
Contributor

@SimonKienzler SimonKienzler commented Feb 2, 2023

What this PR does / why we need it:

A node taint (node.gardener.cloud/critical-components-not-ready) is introduced in gardener/gardener#7406 as part of gardener/gardener#7117. This taint needs to be considered by the MCM when it checks the machine health, as a machine is only supposed to be Ready if the introduced taint is not present.

Co-Authored-By: @timebertt

Which issue(s) this PR fixes:
Part of gardener/gardener#7117

Special notes for your reviewer:

The change has been tested locally in combination with with machine-controller-manager-provider-local in order to verify correct behaviour.

Release note:

Machine object won't turn from `Pending`  to `Running` state if `node.gardener.cloud/critical-components-not-ready` taint is there on the corresponding node.

@SimonKienzler SimonKienzler requested a review from a team as a code owner February 2, 2023 15:43
@gardener-robot
Copy link

@SimonKienzler Thank you for your contribution.

@gardener-robot gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Feb 2, 2023
@gardener-robot-ci-3
Copy link
Contributor

Thank you @SimonKienzler for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the gardener issue I understand that this taint is relevant only during the first time the machine becomes Running. Current changes in the PR also lets the taint affect the health mechanism of MCM. Health mechanism is only relevant after the machines have become Running once. So I have suggested changes , not to affect that.
Please let me know if you think otherwise.

pkg/util/provider/machinecontroller/machine_util.go Outdated Show resolved Hide resolved
pkg/util/provider/machinecontroller/machine_util.go Outdated Show resolved Hide resolved
pkg/util/provider/machinecontroller/machine_util.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Feb 3, 2023
@SimonKienzler
Copy link
Contributor Author

@himanshu-kun thanks for your feedback, I gave it a go - please let me know what you think.

@himanshu-kun himanshu-kun added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 6, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 6, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also add 3 test cases in a different Context block under ##General Machine Health Reconciliation in pkg/util/provider/machinecontroller/machine_util_test.go

  • "Pending machine which is healthy but node without taint shouldn't be marked Running"
  • "Pending machine which is healthy and node has taint should be marked Running"
  • "Unknown machine which is healthy , and node with the taint should be marked Running"

pkg/util/provider/machinecontroller/machine_util.go Outdated Show resolved Hide resolved
@SimonKienzler
Copy link
Contributor Author

@himanshu-kun I added the test cases. ##General Machine Health Reconciliation is a DescribeTable test, I'm not aware of a way to subdivide the entries into different Contexts. Hope it's alright like this.

@himanshu-kun himanshu-kun self-assigned this Feb 8, 2023
@himanshu-kun himanshu-kun added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 8, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 8, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked for one test case. otherwise looks good.

@himanshu-kun himanshu-kun added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 9, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 9, 2023
@himanshu-kun
Copy link
Contributor

@SimonKienzler could you allow me editing rights to this PR. I have some changes which I'd like to make

@SimonKienzler
Copy link
Contributor Author

@himanshu-kun Unfortunately, I cannot (see https://github.com/orgs/community/discussions/5634). Can we make it work in another way?

@timebertt
Copy link
Member

/assign
I will take over this PR while @SimonKienzler is unavailable.

@timebertt
Copy link
Member

@himanshu-kun I cherry-picked your commits. I think, this PR should be ready to merge :)

@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Feb 15, 2023
@himanshu-kun
Copy link
Contributor

We discussed internally and have a common concern that should the node.gardener.cloud/critical-components-not-ready be deciding the Phase of a machine at all?
Running machine itself means that the node is healthy with all the relevant node conditions true, kubelet runnning etc. Taint is just a constraint from the kubernetes side for scheduler. From what I can think of , keeping the machine Pending because of the taint, has only one benefit that the EveryNodeReady condition of shoot would not turn green until critical shoot system components are ready. But then we also have SystemComponentsRunning condition for all system components.
@timebertt do you see any other benefits which I might have missed?

@timebertt
Copy link
Member

I would say this is necessary to prevent rolling updates from tumbling. Otherwise, mcm would continue the rolling update even if the customer workload cannot be scheduled on the node yet because it is not fully ready.

From gardener/gardener#7117:

make machine-controller-manager aware of the new checks and only continue rolling updates once all node-critical pods got ready in addition to waiting for the Ready condition (already done today)

@himanshu-kun himanshu-kun added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 16, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 16, 2023
@timebertt
Copy link
Member

@himanshu-kun please tell me if there is anything missing to merge this PR :)

@himanshu-kun
Copy link
Contributor

No there isn't @timebertt . thanks for the explanation.

Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else labels Feb 17, 2023
@himanshu-kun himanshu-kun merged commit cfede93 into gardener:master Feb 17, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 17, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Feb 17, 2023
@himanshu-kun himanshu-kun changed the title Consider critical-components-not-ready taint when checking machine health Consider critical-components-not-ready taint when machine joins first time Feb 17, 2023
@timebertt timebertt deleted the feature/wait-node-critical branch February 17, 2023 06:50
@himanshu-kun himanshu-kun added this to the v0.49 milestone Mar 20, 2023
himanshu-kun added a commit that referenced this pull request Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants