Allow deletion to proceed in case of VM initialization error #928

rishabh-11 · 2024-07-10T11:19:30Z

What this PR does / why we need it:
This PR fixes the triggerDeletionFlow, specifically the getVMStatus function to allow deletion of Unitialized VMs to proceed

Which issue(s) this PR fixes:
Fixes #926

Special notes for your reviewer:

Release note:

Fixed a bug where the `Unitialised` error code was blocking machine deletion

kon-angelo · 2024-07-10T12:08:18Z

@rishabh-11 what changes are required on the provider e.g. mcm-provider-aws to return the appropriate error ?

rishabh-11 · 2024-07-10T12:35:09Z

No change required on the provider side. The problem was with the triggerDeletionFlow in MCM. It is not handling the Unitialised error code correctly.

rishabh-11 · 2024-07-10T12:36:04Z

Once this PR merges, I'll make a patch release of MCM. The providers will need to be updated with this patch version of MCM

kon-angelo · 2024-07-10T13:14:32Z

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

I am positing that checking the VM status during deletion and blocking the deletion only on that may not be the best thing to do. Especially when these post-init checks are involved in healthchecks of the VM. Or maybe we can have 2 "versions" of the check depending on if we are deleting the VM.

rishabh-11 · 2024-07-10T13:43:05Z

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

This case will be handled in this PR. Let's consider that the user is experimenting and changes something on the VM, causing the GetMachineStatus to return a Unitialized error. In this case the deletion will not be blocked and still proceed.

rishabh-11 · 2024-07-10T13:47:32Z

Consider the case for errors apart from Unitialized for GetMachineStatus. This can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself. I agree that the providerSpec validation done in GetMachineStatus is incorrect as it is a full-fledged check which is not required and can cause problems like the one we saw with GCP. This can be corrected in the provider. But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

elankath · 2024-07-11T03:47:10Z

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Point raised by @kon-angelo feels right. Perhaps it shouldn't be blocked ? (who cares about health at this point?). Ideally, we should just go ahead with driain->deletion in all circumstances except for NotFound. But for now, this fix addresses the current problem.

kon-angelo · 2024-07-11T07:29:07Z

@rishabh-11 Maybe first, I know that the PR solves the issue for provider-aws - so in that regard it is a /lgtm.

I still do not think that it is a nice experience to not be able to delete a VM for something silly like flipping a boolean flag on the VM. @elankath summarised perfectly: why do the full blown healthcheck in this case. You particularly only care if the machine exists or not.

his can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself.

Our implementations already have their own validations before doing a delete call.

This can be corrected in the provider

How exactly ? In this case the provider does not know if it should do the "full-check". You could change the interface to note if the machine is in deletion if you want to go that route.

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Take this as an anecdote, but for provider-openstack I do not implement GetMachineStatus (ref]) because there are also post-creation operations to be done (and there was not InitializeMachine until recently). With that the MCM flow just goes ahead and calls the deleteMachine, which generally works and I haven't seen a provider where this wouldn't work.

I don't really see a case where the delete call itself cannot and should not handle this case. Either the delete would fail, or maybe the delete call must do a get check beforehand - but in either case everything necessary would be handled by the delete call.

rishabh-11 · 2024-07-11T08:36:09Z

@kon-angelo I agree that getVMStatus might not be required in the deletion flow, and we can discuss removing it and raise a separate PR for that.

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present. It works in OpenStack because it relies on the nodeLabel that we put on the machine object once the VM is created. It can work like this for other providers as well, but i don't think relying on a label instead of the cloud provider is the right way to go. wdyt?

How exactly ? In this case the provider does not know if it should do the "full-check".

The "check" here is a check of the providerSpec in the machine class. I meant that we should only check those fields in the providerSpec that are needed for the particular driver method to work and not the entire providerSpec for every call.

kon-angelo · 2024-07-11T08:56:39Z

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present

Sure. Technically without the GetMachineStatus MCM will just call CreateMachine and the provider has to implement more of a "reconcile" function and for openstack the equivalent of GetMachineStatus is just handled internally in the create call to make it idempotent. But i am not arguing on the functionality of GetMachineStatus on reconcile/create - it does make things easier. My argument was more just in the delete case.

The "check" here is a check of the providerSpec in the machine class.

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

Anyway, we can have this discussion offline. I think we all agree that we should merge the PR and go ahead with the fix and optimisations can come later.

kon-angelo

/lgtm

rishabh-11 · 2024-07-11T09:37:31Z

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

The idea behind GetMachineStatus should not be to perform a health check, but I think it has evolved that way. We can relook at that offline.

elankath · 2024-07-12T10:00:01Z

Test Log Before Fix.

Get Instance ID of running machine

k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Running   22m   ip-10-180-20-144.eu-west-1.compute.internal

aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output table | grep aw1-w1-z1-54c64-n2mjz
|  i-020497d720e898696 |  shoot--i034796--aw1-w1-z1-54c64-n2mjz                      |

Enable Source/Dest check.

aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-020497d720e898696

Check describe-instances differences

aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i2.json
diff -U0 -u /tmp/i1.json /tmp/i2.json

--- /tmp/i1.json	2024-07-12 15:15:45
+++ /tmp/i2.json	2024-07-12 15:18:30
@@ -82 +82 @@
-                            "SourceDestCheck": false,
+                            "SourceDestCheck": true,
@@ -97 +97 @@
-                    "SourceDestCheck": false,
+                    "SourceDestCheck": true,

Force Delete the Machine

k label mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true
k delete mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true

Machine is stuck in terminating phase.


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]

elankath · 2024-07-12T10:44:25Z

Test Log Post Fix

Get Instance ID of running machine

k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8   Running   32m   ip-10-180-1-180.eu-west-1.compute.internal

 aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output text | grep aw1-w1-z1-54c64-84gm8
i-01dfe71e6bea2d80a	shoot--i034796--aw1-w1-z1-54c64-84gm8

Enable Source/Dest check.

aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-01dfe71e6bea2d80a

Check describe-instances differences

aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i2.json

diff -U0 -u /tmp/i1.json /tmp/i2.json                                         
--- /tmp/i1.json	2024-07-12 16:00:22
+++ /tmp/i2.json	2024-07-12 16:01:02
@@ -82 +82 @@
-                            "SourceDestCheck": false,
+                            "SourceDestCheck": true,
@@ -97 +97 @@
-                    "SourceDestCheck": false,
+                    "SourceDestCheck": true,

Force Delete the Machine

k label mc shoot--i034796--aw1-w1-z1-54c64-84gm8 force-deletion=true                   
machine.machine.sapcloud.io/shoot--i034796--aw1-w1-z1-54c64-84gm8 labeled

k delete mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                     
machine.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" deleted

Machine goes to Terminating phase, Uninitialized is ignored, goes to node drain.

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                        
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8   Terminating   38m   ip-10-180-1-180.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 -oyaml | grep -C1 description           git:fix-del-flow
  lastOperation:
    description: VM instance was not initalized. Moving forward to node drain. Initiate node drain

SUCCESS: VM is deleted and Machine obj disappears

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8                                         git:fix-del-flow
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" not found

elankath · 2024-07-12T11:43:32Z

Test Post-Fix with Kubelet Crash Simulation (checking that `Failed` flow is unaffected by fix)

See Node Status

 k get no                                                                      
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-180-14-52.eu-west-1.compute.internal   Ready    <none>   29m   v1.29.4

Disable and stop the gardener-node-agent and kubelet services after SSH'ing into node. (use a privileged pod spec for this)

kubectl exec -it priv-pod -- chroot /host /bin/bash
root@ip-10-180-14-52:/# systemctl disable gardener-node-agent.service
Removed "/etc/systemd/system/multi-user.target.wants/gardener-node-agent.service".
root@ip-10-180-14-52:/# systemctl disable kubelet
systemctl stop gardener-node-agent
systemctl stop kubelet

Node goes into NotReady status, Machine goes into Unknown Phase

k get no                                                                      
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-180-14-52.eu-west-1.compute.internal   NotReady   <none>   33m   v1.29.4
   kubectl exec -it priv-pod -- chroot /host /bin/bash

k get mc                                                                               
NAME                                    STATUS   AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r   Failed   45m   ip-10-180-14-52.eu-west-1.compute.internal

Machine goes to Failed state after machine-health-timeout and then goes into Terminating and then disappears

k get mc                                                                               
NAME                                    STATUS   AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r   Failed   47m   ip-10-180-14-52.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-dp49r                                        
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-dp49r" not found

elankath · 2024-07-12T11:47:11Z

@rishabh-11 Tests complete. Please merge and release whenever ready.

…r#928) * allow deletion to proceed in case of VM initialization error * omit tool binaries * set_makefile_env: addec CONTROL_NAMESPACE, LEADER_ELECT --------- Co-authored-by: elankath <tarun.ramakrishna.elankath@sap.com>

…931) * allow deletion to proceed in case of VM initialization error * omit tool binaries * set_makefile_env: addec CONTROL_NAMESPACE, LEADER_ELECT --------- Co-authored-by: elankath <tarun.ramakrishna.elankath@sap.com>

allow deletion to proceed in case of VM initialization error

eef33c9

rishabh-11 requested a review from a team as a code owner July 10, 2024 11:19

rishabh-11 assigned elankath Jul 10, 2024

gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Jul 10, 2024

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 10, 2024

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 10, 2024

kon-angelo approved these changes Jul 11, 2024

View reviewed changes

gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Jul 11, 2024

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 11, 2024

omit tool binaries

b3c4e73

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024

set_makefile_env: addec CONTROL_NAMESPACE, LEADER_ELECT

1467f1a

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024

gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 12, 2024

rishabh-11 merged commit 58eddb2 into gardener:master Jul 15, 2024
8 checks passed

rishabh-11 deleted the fix-del-flow branch July 15, 2024 04:49

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jul 15, 2024

thiyyakat mentioned this pull request Aug 14, 2024

Fix gardener_local_setup script to use correct control cluster namespace #935

Merged

rishabh-11 mentioned this pull request Oct 7, 2024

GetMachineStatus prevents machine deletion, if SrcAndDstChecks is disabled gardener/machine-controller-manager-provider-aws#165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow deletion to proceed in case of VM initialization error #928

Allow deletion to proceed in case of VM initialization error #928

rishabh-11 commented Jul 10, 2024

kon-angelo commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

kon-angelo commented Jul 10, 2024 •

edited

Loading

rishabh-11 commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

elankath commented Jul 11, 2024 •

edited

Loading

kon-angelo commented Jul 11, 2024 •

edited

Loading

rishabh-11 commented Jul 11, 2024

kon-angelo commented Jul 11, 2024

kon-angelo left a comment

rishabh-11 commented Jul 11, 2024

elankath commented Jul 12, 2024 •

edited

Loading

elankath commented Jul 12, 2024

elankath commented Jul 12, 2024 •

edited

Loading

elankath commented Jul 12, 2024

Allow deletion to proceed in case of VM initialization error #928

Allow deletion to proceed in case of VM initialization error #928

Conversation

rishabh-11 commented Jul 10, 2024

kon-angelo commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

kon-angelo commented Jul 10, 2024 • edited Loading

rishabh-11 commented Jul 10, 2024

rishabh-11 commented Jul 10, 2024

elankath commented Jul 11, 2024 • edited Loading

kon-angelo commented Jul 11, 2024 • edited Loading

rishabh-11 commented Jul 11, 2024

kon-angelo commented Jul 11, 2024

kon-angelo left a comment

Choose a reason for hiding this comment

rishabh-11 commented Jul 11, 2024

elankath commented Jul 12, 2024 • edited Loading

Test Log Before Fix.

elankath commented Jul 12, 2024

Test Log Post Fix

elankath commented Jul 12, 2024 • edited Loading

Test Post-Fix with Kubelet Crash Simulation (checking that Failed flow is unaffected by fix)

elankath commented Jul 12, 2024

kon-angelo commented Jul 10, 2024 •

edited

Loading

elankath commented Jul 11, 2024 •

edited

Loading

kon-angelo commented Jul 11, 2024 •

edited

Loading

elankath commented Jul 12, 2024 •

edited

Loading

elankath commented Jul 12, 2024 •

edited

Loading

Test Post-Fix with Kubelet Crash Simulation (checking that `Failed` flow is unaffected by fix)