Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933

rishabh-11 · 2024-08-09T10:40:22Z

How to categorize this issue?

/area robustness
/kind bug
/priority 1

What happened:
According to https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L609-#L611, if NotFound error code is returned by driver.InitialiseMachine, the initialisation of the VM is skipped. This can lead to problems in the following case:-

2024-08-08 07:00:21 | {"log":"Creating a VM for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", please wait!","pid":"1","severity":"INFO","source":"machine.go:392"}Show context
  |   | 2024-08-08 07:00:21 | {"log":"Machine creation request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:82"}
  |   | 2024-08-08 07:00:22 | {"log":"Waiting for VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" to be visible to all AWS endpoints","pid":"1","severity":"INFO","source":"core.go:238"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" should be visible to all AWS endpoints now","pid":"1","severity":"INFO","source":"core.go:249"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" created for Machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:250"}
  |   | 2024-08-08 07:00:22 | {"log":"Created new VM for machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with ProviderID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" and backing node: \"\"","pid":"1","severity":"INFO","source":"machine.go:405"}
  |   | 2024-08-08 07:00:22 | {"log":"Initializing VM instance for Machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:596"}
  |   | 2024-08-08 07:00:22 | {"log":"Error occurred while initializing VM instance for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\": machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"ERR","source":"machine.go:604"}
  |   | 2024-08-08 07:00:22 | {"log":"No VM instance found for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\". Skipping VM instance initialization.","pid":"1","severity":"WARN","source":"machine.go:610"}
  |   | 2024-08-08 07:00:22 | {"log":"Machine labels/annotations UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:552"}Show context
  |   | 2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Stop for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:178"}

Here, for machine shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm, the VM was successfully created, but the initialisation failed as the VM was not found at a later instant. We know that this is an issue from the cloud provider side, but it can happen. In this case, the initialisation is skipped, and the machine object is updated with the providerID and NodeName label.

In the next reconciliation, the GetMachineStatus also has the same transient issue on the cloud provider, and the VM has still not been found. But because the node label was set in the previous reconciliation,

machine-controller-manager/pkg/util/provider/machinecontroller/machine.go

Line 390 in ff82613

if _, present := machine.Labels[v1alpha1.NodeLabelKey]; !present {

is never executed and hence the VM is never initialised, but the machine will be moved to Pending state (and eventually to Running once the Node is registered)

2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Start for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with phase:\"\", description:\"\"","pid":"1","severity":"INFO","source":"machine.go:116"}
  |   | 2024-08-08 07:00:23 | {"log":"Get request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:411"}
  |   | 2024-08-08 07:00:23 | {"log":"For machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", obtained VM error status as: machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"WARN","source":"machine.go:382"}
  |   | 2024-08-08 07:00:23 | {"log":"Machine/status UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" during creation","pid":"1","severity":"INFO","source":"machine.go:578"}

This leads to a problem because things like sourceDestCheck can be enabled/disabled during the initialisation of VM. If the initialization is not done, the pods running on them can go into CLBF, as seen in canary issue no. 5533.

Another problem is that if the driver.InitialiseMachine method keeps on failing, it is still possible for the kubelet to run on the created VM and register the corresponding Node object. This will lead to the scheduler seeing the node and scheduling pods on it, which will go into CLBF as the VM has not been properly initialised.

What you expected to happen:
VM initialisation should be retried in case of NotFound errors and pods should not scheduled on the node until the initialisation of the corresponding VM is successful.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

elankath · 2024-08-09T11:35:36Z

ok, so we need to work around bugs where a cloud provider says a VM was not found even after it was successfully created. 😅

unmarshall · 2024-08-11T06:54:36Z

Yes, right from Neo days we have learnt the hard way that none of the infra providers get the distributed cache implementation right. Even when the resource has been created and confirmed by the infra provider, the subsequent provider GET call does not return that instance. We saw the same issue in Azure as well.
@rishabh-11 and I discussed this and we have a proposal to improve this holistically. We can discuss this in a dedicated meeting.

rishabh-11 · 2024-08-16T09:42:39Z

After discussing with @ScheererJ, we have decided to move forward with the following solution:-

Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.
Adapt MCM to remove the above-mentioned taint once Driver.InitialiseMachine is successfully run (or returns Unimplemented error code).
Adapt MCM to add the node label before we initialise the VM. This is to ensure that we do not create multiple VMs for the same machine object if GetMachineStatus returns a NotFound error.
Change the implementation of InitialiseMachine in provider-aws to always return the Unitialised error code only.

After doing the MCM changes, providers will have to upgrade the MCM dependency and will have to be released. After the provider releases, the corresponding GEP will have to be updated with the correct image. Once all GEPs are released, then we can make the g/g change to add the taint.

rfranzke · 2024-08-16T10:41:24Z

Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.

How will this work when machines are not managed via MCM (e.g., in the context of gardener/gardener#2906)?

unmarshall · 2024-08-16T11:47:46Z

How will this work when machines are not managed via MCM (e.g., in the context of gardener/gardener#2906)?

@rfranzke That is a valid question. Do you have clarity on who will manage virtual machines in an autonomous cluster?

elankath · 2024-08-19T11:14:55Z

We found out that the DescribeInstancesInput is constructed differently in Driver.GetMachineStatus - which uses filters on the machine name versus Driver.CreateMachine - which directly uses the VM instanceID leading to VM instance unfortunately being found by AWS in one case but not in other, despite existing. We will now revise the logic in Driver.GetMachineStatus to fallback and also fallback to try obtaining the VM instance using the simple, direct instanceID in DescribeInstancesInput.

rishabh-11 added the kind/bug Bug label Aug 9, 2024

gardener-robot added area/robustness Robustness, reliability, resilience related priority/1 Priority (lower number equals higher priority) labels Aug 9, 2024

rishabh-11 changed the title ~~Machine is never initialised if Driver.Initialized machine returns NotFound error code for VM~~ Machine is never initialised if Driver.InitializeMachine returns NotFound error code for VM Aug 9, 2024

rishabh-11 added the needs/planning Needs (more) planning with other MCM maintainers label Aug 16, 2024

elankath self-assigned this Aug 19, 2024

rishabh-11 assigned thiyyakat and unassigned elankath Sep 13, 2024

thiyyakat closed this as completed Sep 25, 2024

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933

Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933

rishabh-11 commented Aug 9, 2024 •

edited

Loading

elankath commented Aug 9, 2024 •

edited

Loading

unmarshall commented Aug 11, 2024 •

edited

Loading

rishabh-11 commented Aug 16, 2024

rfranzke commented Aug 16, 2024

unmarshall commented Aug 16, 2024 •

edited

Loading

elankath commented Aug 19, 2024

Machine is never initialised if Driver.InitializeMachine returns NotFound error code for VM #933

Machine is never initialised if Driver.InitializeMachine returns NotFound error code for VM #933

Comments

rishabh-11 commented Aug 9, 2024 • edited Loading

elankath commented Aug 9, 2024 • edited Loading

unmarshall commented Aug 11, 2024 • edited Loading

rishabh-11 commented Aug 16, 2024

rfranzke commented Aug 16, 2024

unmarshall commented Aug 16, 2024 • edited Loading

elankath commented Aug 19, 2024

Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933

Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933

rishabh-11 commented Aug 9, 2024 •

edited

Loading

elankath commented Aug 9, 2024 •

edited

Loading

unmarshall commented Aug 11, 2024 •

edited

Loading

unmarshall commented Aug 16, 2024 •

edited

Loading