Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid the deletion of the machines in CrashLoopBackoff state by the safety controller #589

Merged
merged 1 commit into from
Jan 29, 2021

Conversation

AxiomSamarth
Copy link
Contributor

@AxiomSamarth AxiomSamarth commented Jan 21, 2021

What this PR does / why we need it:
This PR will avoid the deletion of the machines in CrashLoopBackoff state by the safety controller

Which issue(s) this PR fixes:
Fixes #583

Special notes for your reviewer:
More UTs shall be added to test this scenario in the near future. This one is an immediate solution to mitigate the issue.

Release note:

Avoid the deletion of the machines in CrashLoopBackoff state by the safety controller

@AxiomSamarth AxiomSamarth requested review from ggaurav10 and a team as code owners January 21, 2021 07:30
@gardener-robot gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jan 21, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 21, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 21, 2021
@prashanth26
Copy link
Contributor

Hi @AxiomSamarth ,

We would also need this fix and test on the OOT code path here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/util/provider/machinecontroller/machine_safety.go#L222.

@AxiomSamarth
Copy link
Contributor Author

Sure, @prashanth26 I will make the changes and update the PR. Thank you :)

@AxiomSamarth
Copy link
Contributor Author

AxiomSamarth commented Jan 21, 2021

@amshuman-kr @prashanth26 For writing this UT, I had to introduce a new method Add() in the driver interface. Should it be retained or now that we understand how to handle the scenario, should we remove the Add() method from the interface and of course driver implementations?

I also see a similar thing to do in the out of tree code as well. I will have to add a new method in the interface. Am I doing it correctly or is there a different approach?

@amshuman-kr
Copy link

amshuman-kr commented Jan 21, 2021

@AxiomSamarth Thanks for the unit test!

@amshuman-kr @prashanth26 For writing this UT, I had to introduce a new method Add() in the driver interface. Should it be retained or now that we understand how to handle the scenario, should we remove the Add() method from the interface and of course driver implementations?

I don't think it is a good idea to enhance the Driver interface for a functionality that is only used in the unit tests. That leads to nop implementations in proper drivers which is not desirable.

You can enhance just the FakeDriver to have an additional add function field and also make it implement an Add function which delegates to the add function field (but without adding the Add function to the Driver interface). You can also enhance the NewFakeDriver to take the additional function parameter. Then in the unit tests you can typecast the Driver instance returned by the NewFakeDriver to the FakeDriver struct type and call the Add function. This way you get the feature needed by the unit tests without disturbing other Driver implementations.

I also see a similar thing to do in the out of tree code as well. I will have to add a new method in the interface. Am I doing it correctly or is there a different approach?

You can use pretty much the same approach as I mentioned above for OOT as well. Or you can chose to even generate mocks for the OOT interface and simulate the responses that way. I am fine either way.

@amshuman-kr
Copy link

You can use pretty much the same approach as I mentioned above for OOT as well. Or you can chose to even generate mocks for the OOT interface and simulate the responses that way. I am fine either way.

I think the mock approach is more suitable for OOT.

@prashanth26 prashanth26 changed the title Race condition Avoid the deletion of the machines in CrashLoopBackoff state by the safety controller Jan 22, 2021
@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jan 25, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 25, 2021
@AxiomSamarth
Copy link
Contributor Author

The recommended changes are made and pushed to this PR branch.

  • The Driver Interface is restored to as it was before.
  • FakeDriver struct implements the necessary methods for the unit tests for both in-tree and out-of-tree.
  • The proposed logic by @prashanth26 to avoid the machines from being deleted by the safety controller is also included in this for both in-tree and out-of-tree safety controller code.

Looking forward for review comments/suggestions.

Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a small suggestion to move the fakeVMs as a private field in the FakeDriver struct for the OoT case.

pkg/util/provider/driver/fake.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jan 25, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 25, 2021
Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the instance construction inside the NewFakeDriver func. The func can take what parameters are required.

pkg/util/provider/driver/fake.go Outdated Show resolved Hide resolved
pkg/util/provider/driver/fake.go Outdated Show resolved Hide resolved
pkg/util/provider/machinecontroller/ms_test.go Outdated Show resolved Hide resolved
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 25, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 25, 2021
@amshuman-kr
Copy link

amshuman-kr commented Jan 27, 2021

@AxiomSamarth Thanks for the changes. I am not sure there is a good mapping between the in-tree and the out-of-tree the way VMs are added and verified. The in-tree test case is clear to read. The way the fake driver in OoT stores, nodeName etcd in the struct, is the problem I think. I would suggest to improve the fake driver struct and the implementation to be more general (handle multiple VMs/nodes etc.)

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
@AxiomSamarth
Copy link
Contributor Author

Hi Amshu,

I have not really made changes to the structure of the FakeDriver struct. Probably, because I could not think of any better way. However, I have modified the AddMachine method and also employed this AddMachine method in the NewFakeDriver struct to support handling multiple fakeVMs.

Also, with this change, I could add multiple fakeVMs and see through the test results that the machine in crashloopBackOff state remains from deletion while the other test machine in running state gets deleted.

Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. LGTM :-)

Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just a minor change. Also, squash all the commits once you are done.

pkg/controller/ms_test.go Outdated Show resolved Hide resolved
pkg/util/provider/machinecontroller/ms_test.go Outdated Show resolved Hide resolved
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2021
Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else labels Jan 29, 2021
@AxiomSamarth AxiomSamarth merged commit 6eb440c into gardener:master Jan 29, 2021
@AxiomSamarth AxiomSamarth deleted the race-condition branch January 29, 2021 06:05
@prashanth26
Copy link
Contributor

This fix should also resolve https://github.com/gardener/machine-controller-manager/issues/544 according to colleagues from MS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/l Size of pull request is large (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition while creating machines in CrashLoopBackoff state
7 participants