Turn `CrashloopBackoff` machines to `Running` quicker #806

rishabh-11 · 2023-04-12T10:40:32Z

What this PR does / why we need it:
This PR turns CLBF machines to Pending if the driver.CreateMachine call was successful in the previous reconciliation. This will help the reconcileMachineHealth to turn the Pending to the Running phase when the machine is reconciled due to node object creation.

Which issue(s) this PR fixes:
Fixes #805

Special notes for your reviewer:

Release note:

`CrashloopBackoff` machines will turn to `Running` quicker

rishabh-11 · 2023-04-12T10:40:53Z

/invite @himanshu-kun

himanshu-kun

Thanks for the quick PR

2 points:

kindly open draft PR for these kind of reviews
address these review comments first and then I'll let you know about the test cases to add.

himanshu-kun · 2023-04-12T10:42:30Z

pkg/util/provider/app/options/options.go

@@ -109,6 +110,8 @@ func (s *MCServer) AddFlags(fs *pflag.FlagSet) {
 	fs.StringVar(&s.NodeConditions, "node-conditions", s.NodeConditions, "List of comma-separated/case-sensitive node-conditions which when set to True will change machine to a failed state after MachineHealthTimeout duration. It may further be replaced with a new machine if the machine is backed by a machine-set object.")
 	fs.StringVar(&s.BootstrapTokenAuthExtraGroups, "bootstrap-token-auth-extra-groups", s.BootstrapTokenAuthExtraGroups, "Comma-separated list of groups to set bootstrap token's \"auth-extra-groups\" field to")

+	logs.AddFlags(fs)


why is this needed , didn't we deal with this in an earlier PR?

This is for the provider to work. We didn't handle this in the earlier PR. I thought a separate PR is not needed for this, so I added it here only

ok in that case could you add a small comment above it saying it adds --v flags

Ok I will add a comment

himanshu-kun · 2023-04-12T10:50:48Z

pkg/util/provider/machinecontroller/machine_util.go

+			} else if clone.Status.CurrentStatus.Phase == v1alpha1.MachineCrashLoopBackOff {
+				return machineutils.ShortRetry, fmt.Errorf("node object not yet created for Machine %s", machine.Name)
 			}


Instead of this , we could update the if condition where reconcileMachineHealth is called , to ignore CrashloopBackoff machine.
This will save us a shortRetry , as the triggerCreationFlow will be called directly
It makes sense as we reconcileMachineHealth should only be entered for a machine with Unknown or Pending or Running status machine.

Let's say we don't call reconcileMachineHealth, then also we will have a longRetry from the machine reconciliation flow

machine-controller-manager/pkg/util/provider/machinecontroller/machine.go

Line 187 in f873486

return machineutils.LongRetry, nil

Yes but this is the behaviour which used to be there before also
When node is created, then the event for it should again send the machine for reconciliation because now machine obj will be in Pending state.

It won't be in a Pending state. Even if the machine creation is successful, the object will still be in a CrashLoopBackOff state. This is because of

machine-controller-manager/pkg/util/provider/machinecontroller/machine.go

Line 500 in f873486

if machine.Status.CurrentStatus.Phase == "" {

. We will have to make a change in triggerCreationFlow to change the status or add an explicit condition in machine reconciliation flow to have a shortRetry for CLBF machines.

The reason is here the condition status.Node != nodeName was removed. This condition enabled transition of CrashLoopBackoff -> Pending if VM creation is successful.

So we should fix this.

Kindly also add a test case for the same

update the docstring for CrashLoopBackoff state, explicitly telling that this states means there are no infra resources

himanshu-kun · 2023-04-12T10:51:36Z

pkg/util/provider/machinecontroller/machine_util.go

+			} else if clone.Status.CurrentStatus.Phase == v1alpha1.MachineCrashLoopBackOff {
+				return machineutils.ShortRetry, fmt.Errorf("node object not Ready for Machine %s", machine.Name)
 			}


same as above

himanshu-kun

/lgtm

himanshu-kun · 2023-04-17T10:55:37Z

/needs cherry-pick rel-v0.48

gardener-robot · 2023-04-17T10:55:42Z

@himanshu-kun Label needs/rel-v0.48 does not exist.

…unning` quicker (#807) * fix reconcileMachineHealth for clbf machines * address review comments * add unit test for clbf to pending machine

fix reconcileMachineHealth for clbf machines

f873486

rishabh-11 requested a review from a team as a code owner April 12, 2023 10:40

gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Apr 12, 2023

gardener-robot requested a review from himanshu-kun April 12, 2023 10:40

rishabh-11 assigned himanshu-kun Apr 12, 2023

himanshu-kun suggested changes Apr 12, 2023

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Apr 12, 2023

himanshu-kun assigned rishabh-11 and unassigned himanshu-kun Apr 12, 2023

himanshu-kun added this to the v0.49 milestone Apr 13, 2023

rishabh-11 added 2 commits April 14, 2023 11:22

address review comments

87eae38

make generate

44f6939

rishabh-11 requested a review from himanshu-kun April 14, 2023 05:53

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 14, 2023

update docstring for clbf

cd25cae

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 14, 2023

rishabh-11 added the needs/cherry-pick Needs to be cherry-picked to older version label Apr 14, 2023

add comment

29e130e

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023

add unit test for clbf to pending machine

4f1363c

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 17, 2023

himanshu-kun approved these changes Apr 17, 2023

View reviewed changes

gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else labels Apr 17, 2023

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023

himanshu-kun merged commit 29d8222 into gardener:master Apr 17, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Apr 17, 2023

rishabh-11 mentioned this pull request Apr 17, 2023

Automated cherry pick of #806: Turn CrashloopBackoff machines to Running quicker #807

Merged

himanshu-kun removed the needs/cherry-pick Needs to be cherry-picked to older version label May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn `CrashloopBackoff` machines to `Running` quicker #806

Turn `CrashloopBackoff` machines to `Running` quicker #806

rishabh-11 commented Apr 12, 2023 •

edited

Loading

rishabh-11 commented Apr 12, 2023

himanshu-kun left a comment

himanshu-kun Apr 12, 2023

rishabh-11 Apr 12, 2023

himanshu-kun Apr 12, 2023

rishabh-11 Apr 12, 2023

himanshu-kun Apr 12, 2023

rishabh-11 Apr 12, 2023

himanshu-kun Apr 12, 2023

rishabh-11 Apr 12, 2023

himanshu-kun Apr 12, 2023

himanshu-kun Apr 12, 2023

himanshu-kun left a comment

himanshu-kun commented Apr 17, 2023

gardener-robot commented Apr 17, 2023

Turn CrashloopBackoff machines to Running quicker #806

Turn CrashloopBackoff machines to Running quicker #806

Conversation

rishabh-11 commented Apr 12, 2023 • edited Loading

rishabh-11 commented Apr 12, 2023

himanshu-kun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshu-kun left a comment

Choose a reason for hiding this comment

himanshu-kun commented Apr 17, 2023

gardener-robot commented Apr 17, 2023

Turn `CrashloopBackoff` machines to `Running` quicker #806

Turn `CrashloopBackoff` machines to `Running` quicker #806

rishabh-11 commented Apr 12, 2023 •

edited

Loading