Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine is never initialised if Driver.InitializeMachine returns NotFound error code for VM #933

Closed
rishabh-11 opened this issue Aug 9, 2024 · 6 comments
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@rishabh-11
Copy link
Contributor

rishabh-11 commented Aug 9, 2024

How to categorize this issue?

/area robustness
/kind bug
/priority 1

What happened:
According to https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L609-#L611, if NotFound error code is returned by driver.InitialiseMachine, the initialisation of the VM is skipped. This can lead to problems in the following case:-

2024-08-08 07:00:21 | {"log":"Creating a VM for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", please wait!","pid":"1","severity":"INFO","source":"machine.go:392"}Show context
  |   | 2024-08-08 07:00:21 | {"log":"Machine creation request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:82"}
  |   | 2024-08-08 07:00:22 | {"log":"Waiting for VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" to be visible to all AWS endpoints","pid":"1","severity":"INFO","source":"core.go:238"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" should be visible to all AWS endpoints now","pid":"1","severity":"INFO","source":"core.go:249"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" created for Machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:250"}
  |   | 2024-08-08 07:00:22 | {"log":"Created new VM for machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with ProviderID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" and backing node: \"\"","pid":"1","severity":"INFO","source":"machine.go:405"}
  |   | 2024-08-08 07:00:22 | {"log":"Initializing VM instance for Machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:596"}
  |   | 2024-08-08 07:00:22 | {"log":"Error occurred while initializing VM instance for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\": machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"ERR","source":"machine.go:604"}
  |   | 2024-08-08 07:00:22 | {"log":"No VM instance found for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\". Skipping VM instance initialization.","pid":"1","severity":"WARN","source":"machine.go:610"}
  |   | 2024-08-08 07:00:22 | {"log":"Machine labels/annotations UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:552"}Show context
  |   | 2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Stop for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:178"}

Here, for machine shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm, the VM was successfully created, but the initialisation failed as the VM was not found at a later instant. We know that this is an issue from the cloud provider side, but it can happen. In this case, the initialisation is skipped, and the machine object is updated with the providerID and NodeName label.

In the next reconciliation, the GetMachineStatus also has the same transient issue on the cloud provider, and the VM has still not been found. But because the node label was set in the previous reconciliation,

if _, present := machine.Labels[v1alpha1.NodeLabelKey]; !present {
is never executed and hence the VM is never initialised, but the machine will be moved to Pending state (and eventually to Running once the Node is registered)

2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Start for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with phase:\"\", description:\"\"","pid":"1","severity":"INFO","source":"machine.go:116"}
  |   | 2024-08-08 07:00:23 | {"log":"Get request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:411"}
  |   | 2024-08-08 07:00:23 | {"log":"For machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", obtained VM error status as: machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"WARN","source":"machine.go:382"}
  |   | 2024-08-08 07:00:23 | {"log":"Machine/status UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" during creation","pid":"1","severity":"INFO","source":"machine.go:578"}

This leads to a problem because things like sourceDestCheck can be enabled/disabled during the initialisation of VM. If the initialization is not done, the pods running on them can go into CLBF, as seen in canary issue no. 5533.

Another problem is that if the driver.InitialiseMachine method keeps on failing, it is still possible for the kubelet to run on the created VM and register the corresponding Node object. This will lead to the scheduler seeing the node and scheduling pods on it, which will go into CLBF as the VM has not been properly initialised.

What you expected to happen:
VM initialisation should be retried in case of NotFound errors and pods should not scheduled on the node until the initialisation of the corresponding VM is successful.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

@rishabh-11 rishabh-11 added the kind/bug Bug label Aug 9, 2024
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/1 Priority (lower number equals higher priority) labels Aug 9, 2024
@rishabh-11 rishabh-11 changed the title Machine is never initialised if Driver.Initialized machine returns NotFound error code for VM Machine is never initialised if Driver.InitializeMachine returns NotFound error code for VM Aug 9, 2024
@elankath
Copy link
Contributor

elankath commented Aug 9, 2024

ok, so we need to work around bugs where a cloud provider says a VM was not found even after it was successfully created. 😅

@unmarshall
Copy link
Contributor

unmarshall commented Aug 11, 2024

Yes, right from Neo days we have learnt the hard way that none of the infra providers get the distributed cache implementation right. Even when the resource has been created and confirmed by the infra provider, the subsequent provider GET call does not return that instance. We saw the same issue in Azure as well.
@rishabh-11 and I discussed this and we have a proposal to improve this holistically. We can discuss this in a dedicated meeting.

@rishabh-11
Copy link
Contributor Author

After discussing with @ScheererJ, we have decided to move forward with the following solution:-

  1. Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.
  2. Adapt MCM to remove the above-mentioned taint once Driver.InitialiseMachine is successfully run (or returns Unimplemented error code).
  3. Adapt MCM to add the node label before we initialise the VM. This is to ensure that we do not create multiple VMs for the same machine object if GetMachineStatus returns a NotFound error.
  4. Change the implementation of InitialiseMachine in provider-aws to always return the Unitialised error code only.

After doing the MCM changes, providers will have to upgrade the MCM dependency and will have to be released. After the provider releases, the corresponding GEP will have to be updated with the correct image. Once all GEPs are released, then we can make the g/g change to add the taint.

@rishabh-11 rishabh-11 added the needs/planning Needs (more) planning with other MCM maintainers label Aug 16, 2024
@rfranzke
Copy link
Member

  1. Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.

How will this work when machines are not managed via MCM (e.g., in the context of gardener/gardener#2906)?

@unmarshall
Copy link
Contributor

unmarshall commented Aug 16, 2024

How will this work when machines are not managed via MCM (e.g., in the context of gardener/gardener#2906)?

@rfranzke That is a valid question. Do you have clarity on who will manage virtual machines in an autonomous cluster?

@elankath elankath self-assigned this Aug 19, 2024
@elankath
Copy link
Contributor

We found out that the DescribeInstancesInput is constructed differently in Driver.GetMachineStatus - which uses filters on the machine name versus Driver.CreateMachine - which directly uses the VM instanceID leading to VM instance unfortunately being found by AWS in one case but not in other, despite existing. We will now revise the logic in Driver.GetMachineStatus to fallback and also fallback to try obtaining the VM instance using the simple, direct instanceID in DescribeInstancesInput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

6 participants