Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn CrashloopBackoff machines to Running quicker #806

Merged
merged 6 commits into from
Apr 17, 2023

Conversation

rishabh-11
Copy link
Contributor

@rishabh-11 rishabh-11 commented Apr 12, 2023

What this PR does / why we need it:
This PR turns CLBF machines to Pending if the driver.CreateMachine call was successful in the previous reconciliation. This will help the reconcileMachineHealth to turn the Pending to the Running phase when the machine is reconciled due to node object creation.

Which issue(s) this PR fixes:
Fixes #805

Special notes for your reviewer:

Release note:

`CrashloopBackoff` machines will turn to `Running` quicker

@rishabh-11 rishabh-11 requested a review from a team as a code owner April 12, 2023 10:40
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Apr 12, 2023
@rishabh-11
Copy link
Contributor Author

/invite @himanshu-kun

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 12, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick PR

2 points:

  • kindly open draft PR for these kind of reviews
  • address these review comments first and then I'll let you know about the test cases to add.

@@ -109,6 +110,8 @@ func (s *MCServer) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&s.NodeConditions, "node-conditions", s.NodeConditions, "List of comma-separated/case-sensitive node-conditions which when set to True will change machine to a failed state after MachineHealthTimeout duration. It may further be replaced with a new machine if the machine is backed by a machine-set object.")
fs.StringVar(&s.BootstrapTokenAuthExtraGroups, "bootstrap-token-auth-extra-groups", s.BootstrapTokenAuthExtraGroups, "Comma-separated list of groups to set bootstrap token's \"auth-extra-groups\" field to")

logs.AddFlags(fs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed , didn't we deal with this in an earlier PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the provider to work. We didn't handle this in the earlier PR. I thought a separate PR is not needed for this, so I added it here only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok in that case could you add a small comment above it saying it adds --v flags

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will add a comment

Comment on lines 616 to 618
} else if clone.Status.CurrentStatus.Phase == v1alpha1.MachineCrashLoopBackOff {
return machineutils.ShortRetry, fmt.Errorf("node object not yet created for Machine %s", machine.Name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this , we could update the if condition where reconcileMachineHealth is called , to ignore CrashloopBackoff machine.
This will save us a shortRetry , as the triggerCreationFlow will be called directly
It makes sense as we reconcileMachineHealth should only be entered for a machine with Unknown or Pending or Running status machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we don't call reconcileMachineHealth, then also we will have a longRetry from the machine reconciliation flow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this is the behaviour which used to be there before also
When node is created, then the event for it should again send the machine for reconciliation because now machine obj will be in Pending state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be in a Pending state. Even if the machine creation is successful, the object will still be in a CrashLoopBackOff state. This is because of

if machine.Status.CurrentStatus.Phase == "" {
. We will have to make a change in triggerCreationFlow to change the status or add an explicit condition in machine reconciliation flow to have a shortRetry for CLBF machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is here the condition status.Node != nodeName was removed. This condition enabled transition of CrashLoopBackoff -> Pending if VM creation is successful.

  • So we should fix this.
  • Kindly also add a test case for the same
  • update the docstring for CrashLoopBackoff state, explicitly telling that this states means there are no infra resources

Comment on lines 684 to 686
} else if clone.Status.CurrentStatus.Phase == v1alpha1.MachineCrashLoopBackOff {
return machineutils.ShortRetry, fmt.Errorf("node object not Ready for Machine %s", machine.Name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Apr 12, 2023
@himanshu-kun himanshu-kun added this to the v0.49 milestone Apr 13, 2023
@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Apr 14, 2023
@rishabh-11 rishabh-11 requested a review from himanshu-kun April 14, 2023 05:53
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 14, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 14, 2023
@rishabh-11 rishabh-11 added the needs/cherry-pick Needs to be cherry-picked to older version label Apr 14, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 17, 2023
Copy link
Contributor

@himanshu-kun himanshu-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else labels Apr 17, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 17, 2023
@himanshu-kun
Copy link
Contributor

/needs cherry-pick rel-v0.48

@gardener-robot
Copy link

@himanshu-kun Label needs/rel-v0.48 does not exist.

@himanshu-kun himanshu-kun merged commit 29d8222 into gardener:master Apr 17, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Apr 17, 2023
himanshu-kun pushed a commit that referenced this pull request Apr 17, 2023
…unning` quicker (#807)

* fix reconcileMachineHealth for clbf machines

* address review comments

* add unit test for clbf to pending machine
@himanshu-kun himanshu-kun removed the needs/cherry-pick Needs to be cherry-picked to older version label May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change API change with impact on API users needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Regression] CrashloopBackoff don't turn to Running quickly
5 participants