Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Report error if machine's associated node is not found during termination #24

Merged
merged 3 commits into from
Oct 4, 2023

Conversation

Fei-Guo
Copy link
Collaborator

@Fei-Guo Fei-Guo commented Oct 4, 2023

In gpu-provisioner, it takes quite a while for populating the provider id back the machine CR. If we delete the machine CR before node is shown up, the current code will remove the finializer even when the node is stilling being created.

In this change, we report error if machine's associated node is not found during termination so the terminiation will be reconciled until the node is shown up so that we will not leave any dangling node.

One caveat is that if the createAgentPool call fails, the machine also has no provider id. We leverage GC controller to remove the finalizer of the to-be-deleted machine if it has not been lanched for more than 10 minutes (mostly never succeed with retires).

Manual test

2023-10-04T19:54:45.581Z        DEBUG   controller.machine.lifecycle    creating Agent pool regularvms (Standard_D4ls_v5)       {"machine": "regular-vms", "provisioner": "default"}
2023-10-04T19:54:45.581Z        INFO    controller      createAgentPool {"agentpool": "regularvms"}
** Delete the machine at this point**
2023-10-04T19:55:42.908Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "76435098-f492-4cf0-8fb6-7d41f66ca658", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:55:43.909Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "c5c0396f-dc28-450b-b8cd-0bf65fac3d05", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:55:45.910Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "936457f0-f8c4-4403-80a6-3932edde2600", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:55:46.032Z        DEBUG   controller.machine.garbagecollection    Update heartbeat for 0 out of 0 machines
2023-10-04T19:55:49.911Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "c8139e95-0f87-4e1f-8efb-0a28ec7fe70d", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:55:57.912Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "96b0c8c5-0101-45c7-9fd8-4d5cf26f881b", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:56:13.912Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "0a13f93f-892d-49f8-8226-9e32779faecf", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:56:45.913Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "4a4843c5-7b26-4602-9a88-c79c9c140ea6", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:57:45.914Z        ERROR   controller      Reconciler error        {"controller": "machine.termination", "controllerGroup": "karpenter.sh", "controllerKind": "Machine", "Machine": {"name":"regular-vms"}, "namespace": "", "name": "regular-vms", "reconcileID": "fe733e02-a56d-4b2f-afc1-5860e6376a4a", "error": "The machine has not been associated with any node yet"}
2023-10-04T19:57:46.843Z        DEBUG   controller.machine.garbagecollection    Update heartbeat for 0 out of 0 machines
2023-10-04T19:57:52.530Z        DEBUG   controller.machine.lifecycle    created agent pool /subscriptions/ff05f55d-22b5-44a7-b704-f9a8efd493ed/resourcegroups/fei-test/providers/Microsoft.ContainerService/managedClusters/fei-test-vk-permission/agentPools/regularvms   {"machine": "regular-vms", "provisioner": "default"}
2023-10-04T19:57:52.530Z        INFO    controller.machine.lifecycle    launched machine        {"machine": "regular-vms", "provisioner": "default", "provider-id": "azure:///subscriptions/ff05f55d-22b5-44a7-b704-f9a8efd493ed/resourceGroups/mc_fei-test_fei-test-vk-permission_eastus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-regularvms-16521606-vmss/virtualMachines/0", "instance-type": "Standard_D4ls_v5", "zone": "", "capacity-type": "Regular", "allocatable": {"cpu":"3820m","ephemeral-storage":"17852516352","memory":"6254120960","pods":"110"}}
**The machine and created agent pool are deleted from now**
2023-10-04T19:58:45.929Z        INFO    controller      Delete  {"machine": {"name":"regular-vms"}}
2023-10-04T19:58:45.929Z        INFO    controller      Instance.Delete {"id": "azure:///subscriptions/ff05f55d-22b5-44a7-b704-f9a8efd493ed/resourceGroups/mc_fei-test_fei-test-vk-permission_eastus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-regularvms-16521606-vmss/virtualMachines/0"}
2023-10-04T19:58:45.929Z        INFO    controller      deleteAgentPool {"agentpool": "regularvms"}

@Fei-Guo Fei-Guo enabled auto-merge (squash) October 4, 2023 21:42
@Fei-Guo Fei-Guo merged commit d6eb193 into main Oct 4, 2023
@Fei-Guo Fei-Guo deleted the fguo-dev1 branch October 4, 2023 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants