Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow deletion to proceed in case of VM initialization error #928

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

rishabh-11
Copy link
Contributor

What this PR does / why we need it:
This PR fixes the triggerDeletionFlow, specifically the getVMStatus function to allow deletion of Unitialized VMs to proceed

Which issue(s) this PR fixes:
Fixes #926

Special notes for your reviewer:

Release note:

Fixed a bug where the `Unitialised` error code was blocking machine deletion

@rishabh-11 rishabh-11 requested a review from a team as a code owner July 10, 2024 11:19
@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jul 10, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 10, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 10, 2024
@kon-angelo
Copy link
Contributor

@rishabh-11 what changes are required on the provider e.g. mcm-provider-aws to return the appropriate error ?

@rishabh-11
Copy link
Contributor Author

No change required on the provider side. The problem was with the triggerDeletionFlow in MCM. It is not handling the Unitialised error code correctly.

@rishabh-11
Copy link
Contributor Author

Once this PR merges, I'll make a patch release of MCM. The providers will need to be updated with this patch version of MCM

@kon-angelo
Copy link
Contributor

kon-angelo commented Jul 10, 2024

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

I am positing that checking the VM status during deletion and blocking the deletion only on that may not be the best thing to do. Especially when these post-init checks are involved in healthchecks of the VM. Or maybe we can have 2 "versions" of the check depending on if we are deleting the VM.

@rishabh-11
Copy link
Contributor Author

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

This case will be handled in this PR. Let's consider that the user is experimenting and changes something on the VM, causing the GetMachineStatus to return a Unitialized error. In this case the deletion will not be blocked and still proceed.

@rishabh-11
Copy link
Contributor Author

Consider the case for errors apart from Unitialized for GetMachineStatus. This can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself. I agree that the providerSpec validation done in GetMachineStatus is incorrect as it is a full-fledged check which is not required and can cause problems like the one we saw with GCP. This can be corrected in the provider. But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

@elankath
Copy link
Contributor

elankath commented Jul 11, 2024

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Point raised by @kon-angelo feels right. Perhaps it shouldn't be blocked ? (who cares about health at this point?). Ideally, we should just go ahead with driain->deletion in all circumstances except for NotFound. But for now, this fix addresses the current problem.

@kon-angelo
Copy link
Contributor

kon-angelo commented Jul 11, 2024

@rishabh-11 Maybe first, I know that the PR solves the issue for provider-aws - so in that regard it is a /lgtm.

I still do not think that it is a nice experience to not be able to delete a VM for something silly like flipping a boolean flag on the VM. @elankath summarised perfectly: why do the full blown healthcheck in this case. You particularly only care if the machine exists or not.

his can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself.

Our implementations already have their own validations before doing a delete call.

This can be corrected in the provider

How exactly ? In this case the provider does not know if it should do the "full-check". You could change the interface to note if the machine is in deletion if you want to go that route.

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Take this as an anecdote, but for provider-openstack I do not implement GetMachineStatus (ref]) because there are also post-creation operations to be done (and there was not InitializeMachine until recently). With that the MCM flow just goes ahead and calls the deleteMachine, which generally works and I haven't seen a provider where this wouldn't work.

I don't really see a case where the delete call itself cannot and should not handle this case. Either the delete would fail, or maybe the delete call must do a get check beforehand - but in either case everything necessary would be handled by the delete call.

@rishabh-11
Copy link
Contributor Author

@kon-angelo I agree that getVMStatus might not be required in the deletion flow, and we can discuss removing it and raise a separate PR for that.

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present. It works in OpenStack because it relies on the nodeLabel that we put on the machine object once the VM is created. It can work like this for other providers as well, but i don't think relying on a label instead of the cloud provider is the right way to go. wdyt?

How exactly ? In this case the provider does not know if it should do the "full-check".

The "check" here is a check of the providerSpec in the machine class. I meant that we should only check those fields in the providerSpec that are needed for the particular driver method to work and not the entire providerSpec for every call.

@kon-angelo
Copy link
Contributor

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present

Sure. Technically without the GetMachineStatus MCM will just call CreateMachine and the provider has to implement more of a "reconcile" function and for openstack the equivalent of GetMachineStatus is just handled internally in the create call to make it idempotent. But i am not arguing on the functionality of GetMachineStatus on reconcile/create - it does make things easier. My argument was more just in the delete case.

The "check" here is a check of the providerSpec in the machine class.

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

Anyway, we can have this discussion offline. I think we all agree that we should merge the PR and go ahead with the fix and optimisations can come later.

Copy link
Contributor

@kon-angelo kon-angelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Jul 11, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 11, 2024
@rishabh-11
Copy link
Contributor Author

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

The idea behind GetMachineStatus should not be to perform a health check, but I think it has evolved that way. We can relook at that offline.

@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024
@elankath
Copy link
Contributor

elankath commented Jul 12, 2024

Test Log Before Fix.

  1. Get Instance ID of running machine
k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Running   22m   ip-10-180-20-144.eu-west-1.compute.internal

aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output table | grep aw1-w1-z1-54c64-n2mjz
|  i-020497d720e898696 |  shoot--i034796--aw1-w1-z1-54c64-n2mjz                      |
  1. Enable Source/Dest check.
aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-020497d720e898696
  1. Check describe-instances differences
aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i2.json
diff -U0 -u /tmp/i1.json /tmp/i2.json

--- /tmp/i1.json	2024-07-12 15:15:45
+++ /tmp/i2.json	2024-07-12 15:18:30
@@ -82 +82 @@
-                            "SourceDestCheck": false,
+                            "SourceDestCheck": true,
@@ -97 +97 @@
-                    "SourceDestCheck": false,
+                    "SourceDestCheck": true,
  1. Force Delete the Machine
k label mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true
k delete mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true
  1. Machine is stuck in terminating phase.

k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]

k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]

@elankath
Copy link
Contributor

Test Log Post Fix

  1. Get Instance ID of running machine
k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8   Running   32m   ip-10-180-1-180.eu-west-1.compute.internal

 aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output text | grep aw1-w1-z1-54c64-84gm8
i-01dfe71e6bea2d80a	shoot--i034796--aw1-w1-z1-54c64-84gm8
  1. Enable Source/Dest check.
aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-01dfe71e6bea2d80a
  1. Check describe-instances differences
aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i2.json

diff -U0 -u /tmp/i1.json /tmp/i2.json                                         
--- /tmp/i1.json	2024-07-12 16:00:22
+++ /tmp/i2.json	2024-07-12 16:01:02
@@ -82 +82 @@
-                            "SourceDestCheck": false,
+                            "SourceDestCheck": true,
@@ -97 +97 @@
-                    "SourceDestCheck": false,
+                    "SourceDestCheck": true,
  1. Force Delete the Machine
k label mc shoot--i034796--aw1-w1-z1-54c64-84gm8 force-deletion=true                   
machine.machine.sapcloud.io/shoot--i034796--aw1-w1-z1-54c64-84gm8 labeled

k delete mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                     
machine.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" deleted

  1. Machine goes to Terminating phase, Uninitialized is ignored, goes to node drain.
k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                        
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8   Terminating   38m   ip-10-180-1-180.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 -oyaml | grep -C1 description           git:fix-del-flow
  lastOperation:
    description: VM instance was not initalized. Moving forward to node drain. Initiate node drain

  1. SUCCESS: VM is deleted and Machine obj disappears
k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                         git:fix-del-flow
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" not found

@elankath
Copy link
Contributor

elankath commented Jul 12, 2024

Test Post-Fix with Kubelet Crash Simulation (checking that Failed flow is unaffected by fix)

  1. See Node Status
 k get no                                                                      
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-180-14-52.eu-west-1.compute.internal   Ready    <none>   29m   v1.29.4
  1. Disable and stop the gardener-node-agent and kubelet services after SSH'ing into node. (use a privileged pod spec for this)
kubectl exec -it priv-pod -- chroot /host /bin/bash
root@ip-10-180-14-52:/# systemctl disable gardener-node-agent.service
Removed "/etc/systemd/system/multi-user.target.wants/gardener-node-agent.service".
root@ip-10-180-14-52:/# systemctl disable kubelet
systemctl stop gardener-node-agent
systemctl stop kubelet
  1. Node goes into NotReady status, Machine goes into Unknown Phase
k get no                                                                      
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-180-14-52.eu-west-1.compute.internal   NotReady   <none>   33m   v1.29.4
   kubectl exec -it priv-pod -- chroot /host /bin/bash

k get mc                                                                               
NAME                                    STATUS   AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r   Failed   45m   ip-10-180-14-52.eu-west-1.compute.internal
  1. Machine goes to Failed state after machine-health-timeout and then goes into Terminating and then disappears
k get mc                                                                               
NAME                                    STATUS   AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r   Failed   47m   ip-10-180-14-52.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-dp49r                                        
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-dp49r" not found
   

@elankath
Copy link
Contributor

@rishabh-11 Tests complete. Please merge and release whenever ready.

@rishabh-11 rishabh-11 merged commit 58eddb2 into gardener:master Jul 15, 2024
8 checks passed
@rishabh-11 rishabh-11 deleted the fix-del-flow branch July 15, 2024 04:49
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jul 15, 2024
rishabh-11 added a commit to rishabh-11/machine-controller-manager that referenced this pull request Jul 15, 2024
…r#928)

* allow deletion to proceed in case of VM initialization error

* omit tool binaries

* set_makefile_env: addec CONTROL_NAMESPACE, LEADER_ELECT

---------

Co-authored-by: elankath <tarun.ramakrishna.elankath@sap.com>
rishabh-11 added a commit that referenced this pull request Jul 15, 2024
…931)

* allow deletion to proceed in case of VM initialization error

* omit tool binaries

* set_makefile_env: addec CONTROL_NAMESPACE, LEADER_ELECT

---------

Co-authored-by: elankath <tarun.ramakrishna.elankath@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GetMachineStatus prevents machine deletion in case of Unitialised error code
6 participants