-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine is never initialised if Driver.InitializeMachine
returns NotFound
error code for VM
#933
Comments
Driver.Initialized
machine returns NotFound
error code for VMDriver.InitializeMachine
returns NotFound
error code for VM
ok, so we need to work around bugs where a cloud provider says a VM was not found even after it was successfully created. 😅 |
Yes, right from Neo days we have learnt the hard way that none of the infra providers get the distributed cache implementation right. Even when the resource has been created and confirmed by the infra provider, the subsequent provider GET call does not return that instance. We saw the same issue in Azure as well. |
After discussing with @ScheererJ, we have decided to move forward with the following solution:-
After doing the MCM changes, providers will have to upgrade the MCM dependency and will have to be released. After the provider releases, the corresponding GEP will have to be updated with the correct image. Once all GEPs are released, then we can make the g/g change to add the taint. |
How will this work when machines are not managed via MCM (e.g., in the context of gardener/gardener#2906)? |
@rfranzke That is a valid question. Do you have clarity on who will manage virtual machines in an autonomous cluster? |
We found out that the |
How to categorize this issue?
/area robustness
/kind bug
/priority 1
What happened:
According to https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L609-#L611, if
NotFound
error code is returned bydriver.InitialiseMachine
, the initialisation of the VM is skipped. This can lead to problems in the following case:-Here, for machine
shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm
, the VM was successfully created, but the initialisation failed as the VM was not found at a later instant. We know that this is an issue from the cloud provider side, but it can happen. In this case, the initialisation is skipped, and the machine object is updated with theproviderID
andNodeName
label.In the next reconciliation, the
GetMachineStatus
also has the same transient issue on the cloud provider, and the VM has still not been found. But because the node label was set in the previous reconciliation,machine-controller-manager/pkg/util/provider/machinecontroller/machine.go
Line 390 in ff82613
Pending
state (and eventually toRunning
once the Node is registered)This leads to a problem because things like
sourceDestCheck
can be enabled/disabled during the initialisation of VM. If the initialization is not done, the pods running on them can go into CLBF, as seen in canary issue no. 5533.Another problem is that if the
driver.InitialiseMachine
method keeps on failing, it is still possible for the kubelet to run on the created VM and register the corresponding Node object. This will lead to the scheduler seeing the node and scheduling pods on it, which will go into CLBF as the VM has not been properly initialised.What you expected to happen:
VM initialisation should be retried in case of
NotFound
errors and pods should not scheduled on the node until the initialisation of the corresponding VM is successful.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: