Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Machine Creation flow to make sure node label is updated before initialization of VM. Modify Deletion flow to call DeleteMachine even if VM is not found. #940

Merged
merged 13 commits into from
Sep 13, 2024

Conversation

thiyyakat
Copy link
Contributor

@thiyyakat thiyyakat commented Sep 5, 2024

What this PR does / why we need it:

The PR changes triggerCreationFlow to update machine labels before proceeding to initialization. This was done to make sure that the labels are always updated even in the case of initialization failures.

It also changes getVMStatus to update the labels in the following way:

  1. list all the nodes and find matching machine name label ( effective when Introduce a feature to propagate the machine name to user data and as label to the node #919 and Introduce node-agent-authorizer webhook to authenticate gardener-node-agents in shoots gardener#10014 are released)
  2. If the above does not find a node name, do a GetMachineStatus call and populate the node name.

In addition, getVMStatus will always redirect to initiateDrain.

A change was also made to remove error logs when InitializeMachine is not implemented by a provider.

Which issue(s) this PR fixes:
Fixes #934 #936
Fixes part of #933

Special notes for your reviewer:

The changes in the PR were tested by doing the following:
Manually returned an error from machine-controller-manager-provider-azure's CreateMachine code after NIC is created, such that the VM is not created. The machine was then marked for deletion. The logs showed that the NIC was deleted successfully.

Manually returned codes.Uninitialized error code from InitializeMachine for AWS. On triggering creation of machine, found that initialization is tried in a loop after shortRetry.

Manually returned codes.Uninitialized from initializeMachine after the call to driver.initializeMachine. The logs show that the machine state update is successful. In the next reconciliation, since the machine is found to be initialized, machine goes to running.

Release note:

`getVMStatus` always redirects to `InitiateDrain`. It also populates the node label on the machine object by checking `node.gardener.cloud/machine-name` label on the nodes. 
Fixed a bug where failure of machine initialization caused label updates to not happen. 

@thiyyakat thiyyakat requested a review from a team as a code owner September 5, 2024 06:59
@gardener-robot gardener-robot added the needs/review Needs review label Sep 5, 2024
@gardener-robot
Copy link

@thiyyakat Thank you for your contribution.

@gardener-robot gardener-robot added the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Sep 5, 2024
@gardener-robot-ci-2
Copy link
Contributor

Thank you @thiyyakat for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@thiyyakat thiyyakat added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 5, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 5, 2024
@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Sep 6, 2024
@gardener-robot
Copy link

@thiyyakat You need rebase this pull request with latest master branch. Please check.

@rishabh-11 rishabh-11 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 6, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 6, 2024
return retryPeriod, err
}
// Return error even when machine object is updated
err = fmt.Errorf("Machine creation in process. Machine UPDATE successful")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this error msg correct? If the initializeMachine call has succeeded and there is no error then this error is created with a msg that perhaps is not very clear. What is intended here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look at the call stack, this error message is propagated to reconcileClusterMachineKey which calls enqueueMachineAfter where this message is used to print a log specifying the reason for the enqueue operation.

This is not something new that has been introduced. We are preserving the original way. Changing the message to Machine creation in process. Machine initialization (if required) and label update successful

}
}
if uninitializedMachine {
retryPeriod, err := c.initializeMachine(ctx, createMachineRequest.Machine, createMachineRequest.MachineClass, createMachineRequest.Secret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before calling initializeMachine we now update the labels. This is as we discussed. If there is an update done then the genration of the resource changes, right? Then later you take the machine from the request object (which is now old) and then that is passed into initializeMachine which also attempt to update the status by making a clone of the now old object in case of errors. This could result in conflict. Have you tested this?

nodes, err = c.nodeLister.List(labels.Everything())
if err == nil {
for _, node := range nodes {
if node.Labels["node.gardener.cloud/machine-name"] == getMachineStatusRequest.Machine.Name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Oliver's PR is there a constant created for this label?

// Node label is required for drain of node, therefore we try to update machine object before proceeding to drain.
isNodeLabelUpdated := false
//check if node name label is already present in machine object
nodeName := getMachineStatusRequest.Machine.Labels[v1alpha1.NodeLabelKey]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare this as a variable instead of short assign.

// get all nodes and check if any node has the machine name as label
var nodes []*v1.Node
nodes, err = c.nodeLister.List(labels.Everything())
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an error it is totally ignored. I would suggest that you at least log the error for better debuggability.

} else {
// Figure out node label either by checking all nodes for label matching machine name or retrieving it using GetMachineStatus
// get all nodes and check if any node has the machine name as label
var nodes []*v1.Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract the logic to look for nodeName matching the machine name into a separate function. That function should just find the matching nodeName and the update should happen in the caller.

@@ -944,77 +944,94 @@ func (c *controller) setMachineTerminationStatus(ctx context.Context, deleteMach
return machineutils.ShortRetry, err
}

// getVMStatus tries to retrive VM status backed by machine
// getVMStatus tries to retrieve VM status backed by machine
func (c *controller) getVMStatus(ctx context.Context, getMachineStatusRequest *driver.GetMachineStatusRequest) (machineutils.RetryPeriod, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function is named getVMStatus but it does not return any status. This is a bit weird. Shouldn't it be called updateMachineStatusAndNodeLabel instead?

isNodeLabelUpdated := false
//check if node name label is already present in machine object
nodeName := getMachineStatusRequest.Machine.Labels[v1alpha1.NodeLabelKey]
if isValidNodeName(nodeName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is actually quite useless. It only checks if the node Name is empty. I would just remove this function.

}
}

if !isNodeLabelUpdated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to rewrite this function.
An idea:

func (c *controller) updateMachineNodeLabelAndStatus(ctx context.Context, machine *v1alpha1.Machine) {
   // calls getNodeName
   // if there is no error then updates the node label
   // if there is an error either in calling getNodeName or updating the node label then updates the status accordingly

}

func (c *controller) getNodeName(ctx context.Context, machine *v1alpha1.Machine) (string, error) {
	nodeName := machine.Labels[v1alpha1.NodeLabelKey]
	if len(strings.TrimSpace(nodeName)) > 0 {
		return nodeName, nil
	}
	matchingNode, err := c.fetchMatchingNode(ctx, machine.Name)	
	if err == nil && matchingNode != nil {
		return matchingNode.Name, nil
	}
	if err != nil {
		klog.Errorf("Error trying to get node matching machine %s: %v. Will try to get the node name by calling driver.GetMachineStatus instead.", machine.Name, err)
	}
	// call GetMachineStatus to get the node name	
}

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Sep 9, 2024
@thiyyakat thiyyakat changed the title Change Machine Creation and Deletion to Handle Transient Issues in Cloud Provider Calls Modify Machine Creation flow to make sure node label is updated before initialization of VM. Modify Deletion flow to call DeleteMachine even if VM is not found. Sep 9, 2024
return machineutils.LongRetry, nil
}

func (c *controller) updateLabelsAndInitializeMachine(ctx context.Context, createMachineRequest *driver.CreateMachineRequest, nodeName, providerID string, shouldInitializeMachine bool) (retryPeriod machineutils.RetryPeriod, err error) {
_, machineNodeLabelPresent := createMachineRequest.Machine.Labels[v1alpha1.NodeLabelKey]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could have used:

metav1.HasAnnotation(createMachineRequest.Machine.ObjectMeta, v1alpha1.NodeLabelKey)

}
}
if shouldInitializeMachine {
retryPeriod, err = c.initializeMachine(ctx, clone, createMachineRequest.MachineClass, createMachineRequest.Secret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have initializeMachine call from here? I would have just renamed this method to updateMachineLabels since we already have initializeMachine. The just call these functions from triggerCreationFlow. This also removes the need to pass shouldInitializeMachine as a method argument.

return retryPeriod, err
}
// Return error even when machine object is updated
err = fmt.Errorf("Machine creation in process. Machine initialization (if required) is successful.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messages given to fmt.Errorf should not start with a Capital letter and should not end with a dot. You will see warnings shown in the IDE. Please correct it. There are other places as well where i see this issue. Can you at least correct it in the functions that you are touching?

@rishabh-11 rishabh-11 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 12, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 12, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 12, 2024
Copy link
Contributor

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -497,6 +492,8 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
case codes.Uninitialized:
uninitializedMachine = true
klog.Infof("VM instance associated with machine %s was created but not initialized.", machine.Name)
nodeName = getMachineStatusResponse.NodeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that we will go with this even though this is quite dirty. We add a comment to improve this further.
The easier way to do this would be to add a bool pointer indicating the status of initialization status. If it is nil, then no initialization needs to be done (ignored). If it is false/true then we act upon it. Essentially do not return an error as initialization becomes part of the VM status.

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/rebase Needs git rebase needs/review Needs review labels Sep 13, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 13, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 13, 2024
@rishabh-11 rishabh-11 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 13, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 13, 2024
@rishabh-11 rishabh-11 merged commit 9bded0e into gardener:master Sep 13, 2024
8 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error handling when InitializeMachine method is not yet implement by a provider
7 participants