Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve launch template diff detection for MachinePools #4194

Closed

Conversation

cnmcavoy
Copy link
Contributor

@cnmcavoy cnmcavoy commented Apr 3, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:
Improve launch template diff detection for MachinePools:

  • Store tags related to the launch template sooner (unnecessary churn occurs when reconciliation errors part-way)
  • Handle missing ASG case better (presently 2 launch template versions are created for the initial asg)
  • Defaulting webhook should set necessary defaults for MachinePool launch templates (unbounded churn due to missing defaults)

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4070

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Improve launch template diff detection for MachinePools to reduce unnecessary launch template versions. Logging around launch template diffs has also been improved to highlight why new versions are created.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 3, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign richardcase for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch 2 times, most recently from a69873f to 7db8dc5 Compare April 3, 2023 20:56
@Ankitasw
Copy link
Member

Ankitasw commented Apr 4, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 4, 2023
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch from 7db8dc5 to b212135 Compare April 4, 2023 15:45
@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Apr 4, 2023

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@dlipovetsky
Copy link
Contributor

Thanks for working on this, @cnmcavoy!

A small nit: There are 3 separate improvements. I think it would have helped reviewers if they were each at least a separate commit, but perhaps even a separate PR, because they are really distinct pieces of work.

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented May 2, 2023

For additional context, the reason these all ended up in a single PR is that this was already a split-off from my work in #4184. I wasn't trying to fix these bugs, but encountered them in the course of that other work and they were in my way. I am open to split this further up, if we feel it's necessary for a proper review, but the tradeoff is it slows down getting this fixed.

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch from 2e52570 to 9c02ddd Compare May 5, 2023 21:09
@cnmcavoy cnmcavoy requested a review from dlipovetsky May 11, 2023 14:40
@cnmcavoy
Copy link
Contributor Author

anything left that I missed on this PR? @dlipovetsky

@dlipovetsky
Copy link
Contributor

I didn't realize you had addressed the default issue as well. I'll review again.

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch from 9c02ddd to 70f62a9 Compare May 11, 2023 23:00
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2023
  * Store tags related to the launch template sooner
  * Handle missing ASG case better
  * Defaulting webhook should set necessary defaults for MachinePool launch templates
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch from 70f62a9 to 178172b Compare May 12, 2023 15:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2023
@cnmcavoy cnmcavoy requested a review from dlipovetsky May 12, 2023 15:33
@Skarlso
Copy link
Contributor

Skarlso commented May 12, 2023

@Skarlso take a look at this

@dlipovetsky
Copy link
Contributor

I still need to review the reconcile loop changes. I'll get to that today.

}

if !cmp.Equal(mixedInstancesPolicy, existingASG.MixedInstancesPolicy) {
machinePoolScope.Info("MixedInstancesPolicy diff detected", "incoming", spew.Sprintf("%#v", mixedInstancesPolicy), "existing", spew.Sprintf("%#v", existingASG.MixedInstancesPolicy))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use spew in the logs. I'm pretty sure this would look super ugly. :D Can you print individual values instead or somehow make a prettier output?

Copy link
Contributor Author

@cnmcavoy cnmcavoy May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without expanding the footprint of the code there significantly. MixedInstancesPolicy is optional (could be nil on either object) and has further optional nil-able inner struct fields, and golang lacks a safe navigator operator :(

The output is the struct object. %#v shows the struct + field names, so it's human readable.

I originally explored implementing Stringer interfaces on these structs, so that printing the objects might work, but that touched many unrelated areas, so it seemed less desirable than the spew calls here.

Easiest answer is probably to remove the log line here. But it will make debugging launch templates harder again. :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our goal is to help the user understand the diff, if one occurs. For representing the diff, I think https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff is very readable. Had you already considered it?

For example, this output

2023/05/16 14:58:17 (-existing, +actual)   &v1beta2.MixedInstancesPolicy{
  	InstancesDistribution: &v1beta2.InstancesDistribution{
- 		OnDemandAllocationStrategy:          "",
+ 		OnDemandAllocationStrategy:          "prioritized",
- 		SpotAllocationStrategy:              "lowest-price",
+ 		SpotAllocationStrategy:              "",
  		OnDemandBaseCapacity:                nil,
  		OnDemandPercentageAboveBaseCapacity: nil,
  	},
  	Overrides: nil,
  }

is the result of the example

package main

import (
	"log"

	"github.com/google/go-cmp/cmp"

	expinfrav1 "sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2"
)

func main() {
	existing := &expinfrav1.MixedInstancesPolicy{
		InstancesDistribution: &expinfrav1.InstancesDistribution{
			SpotAllocationStrategy: expinfrav1.SpotAllocationStrategyLowestPrice,
		},
		Overrides: nil,
	}

	actual := &expinfrav1.MixedInstancesPolicy{
		InstancesDistribution: &expinfrav1.InstancesDistribution{
			OnDemandAllocationStrategy: expinfrav1.OnDemandAllocationStrategyPrioritized,
		},
		Overrides: nil,
	}
	diff := cmp.Diff(existing, actual)
	log.Println("(-existing, +actual)", diff)
}

However, the output is readable because it's multiline, and our logs are structured, so I think the newlines will be escaped and everything fit into a single line, which will be very hard to read.

Have you considered logging that the diff occurred, but writing the diff itself to an Event?

Copy link
Contributor Author

@cnmcavoy cnmcavoy May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our goal is to help the user understand the diff, if one occurs. For representing the diff, I think https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff is very readable. Had you already considered it?

I'm not very familiar with that library, so no. But that functionality looks like it could produce the output we want.

@dlipovetsky what if we use inspiration from your previous suggestion about how to refactor the mixedInstancesPolicy to take a different approach, using cmp.Diff to detect the differences in the ASG, and then use that output for the event?

It avoids all the complex if/else flows and logging pitfalls as far as I can tell.

// detectASGDrift compares incoming AWSMachinePool and compares against existing ASG.
func detectASGDrift(machinePoolScope *scope.MachinePoolScope, existingASG *expinfrav1.AutoScalingGroup) string {
	detectedMachinePoolSpec := machinePoolScope.MachinePool.Spec.DeepCopy()

	if !scope.ReplicasExternallyManaged(machinePoolScope.MachinePool) {
		detectedMachinePoolSpec.Replicas = existingASG.DesiredCapacity
	}
	if diff := cmp.Diff(machinePoolScope.MachinePool.Spec, *detectedMachinePoolSpec); diff != "" {
		return diff
	}

	detectedAWSMachinePoolSpec := machinePoolScope.AWSMachinePool.Spec.DeepCopy()
	detectedAWSMachinePoolSpec.MaxSize = existingASG.MaxSize
	detectedAWSMachinePoolSpec.MinSize = existingASG.MinSize
	detectedAWSMachinePoolSpec.CapacityRebalance = existingASG.CapacityRebalance
	{
		mixedInstancesPolicy := machinePoolScope.AWSMachinePool.Spec.MixedInstancesPolicy
		// InstancesDistribution is optional, and the default values come from AWS, so
		// they are not set by the AWSMachinePool defaulting webhook. If InstancesDistribution is
		// not set, we use the AWS values for the purpose of comparison.
		if mixedInstancesPolicy != nil && mixedInstancesPolicy.InstancesDistribution == nil {
			mixedInstancesPolicy = machinePoolScope.AWSMachinePool.Spec.MixedInstancesPolicy.DeepCopy()
			mixedInstancesPolicy.InstancesDistribution = existingASG.MixedInstancesPolicy.InstancesDistribution
		}

		if !cmp.Equal(mixedInstancesPolicy, existingASG.MixedInstancesPolicy) {
			detectedAWSMachinePoolSpec.MixedInstancesPolicy = existingASG.MixedInstancesPolicy
		}
	}

	return cmp.Diff(machinePoolScope.AWSMachinePool.Spec, *detectedAWSMachinePoolSpec)
}

And then in updatePool use that:

	asgDiff := detectASGDrift(machinePoolScope, existingASG)
	if asgDiff != "" {
		r.Recorder.Eventf(machinePoolScope.AWSMachinePool, corev1.EventTypeNormal, "ASGDiffDetected", asgDiff)
	}
	if asgDiff != "" || subnetChanges {
		machinePoolScope.Info("updating AutoScalingGroup")
...

I updated the PR for this approach. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that sounds really nice. I'll take a look at this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

@cnmcavoy cnmcavoy requested a review from Skarlso May 18, 2023 19:38
@cnmcavoy
Copy link
Contributor Author

Any final concerns on this PR?

Comment on lines 431 to 436
r.Recorder.Eventf(machinePoolScope.AWSMachinePool, corev1.EventTypeNormal, "ASGDiffDetected", subnetChanges)
}

if asgNeedsUpdates(machinePoolScope, existingASG) || subnetChanges {
asgDiff := detectASGDrift(machinePoolScope, existingASG)
if asgDiff != "" {
r.Recorder.Eventf(machinePoolScope.AWSMachinePool, corev1.EventTypeNormal, "ASGDiffDetected", asgDiff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about it, I'm not sure an Event is what we want. Events have a 1 hour TTL by default.

Is our goal to record this information always, or only if the user is asking for more verbose logs?

If it's the latter, then we could use machinePoolScope.Debug, as we do in other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Kubernetes drops events if the queue is full. So events should be something that is okay to be missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't have strong opinions here, so I swapped back to machinePoolScope.Debug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

runPostLaunchTemplateUpdateOperation := func() error {
launchTemplateID := machinePoolScope.GetLaunchTemplateIDStatus()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all this happen after the skipping instance refresh check? If instance refresh is completely disabled, none of these things should happen shouldn't they? Like, updating the tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never mind. Of course, the ASG is using the LaunchTemplate. So we assure that it's up to date. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must reconcile tags after updating the launch template, which is why this logic was moved into the beginning of runPostLaunchTemplateUpdateOperation - because we need to update our record of the tags on the launch template regardless of whether or not we are doing a refresh. The tags are a property of launch templates, not asgs.

I didn't write this logic, only relocated it from elsewhere in the codepath. Previously we only reconciled tags if the asg could be created, which is too late, because an error could have occurred.

conditions.MarkUnknown(machinePoolScope.AWSMachinePool, expinfrav1.ASGReadyCondition, expinfrav1.ASGNotFoundReason, err.Error())
return ctrl.Result{}, err
}

if asg == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm a bit confused here. We reconcile the launch template which does check for an ASG, but we only do the if asg doesn't exist create and requeue AFTER we reconciled the launch template which was looking for an ASG. I mean it's obvious that the ASG won't be there, so why not, if the ASG doesn't exists, create it, then requeue and proceed. I'm not sure I get the order of things in here. :D Can you educate me please? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASG creation depends on the launch template creation, and the ASG instance refresh depends on the launch template being modified. I think these two dependencies are coupled in the implementation and causing confusion.

My understanding is that we can't move the ASG creation before ec2Svc.ReconcileLaunchTemplate is invoked, because we need that to succeed to ensure a launch template exists to use for the ASG.

However, that is also the function that detects if we modified the launch template and runs the post-update callback.

Maybe we should open a separate issue about clarifying these codepaths? I'm hesitant to add more to this PR or else it will never get merged...

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/launch-template-diffs branch from 4593ae5 to c9b28d3 Compare May 22, 2023 21:09
@Skarlso
Copy link
Contributor

Skarlso commented May 23, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2023
@Ankitasw
Copy link
Member

/lgtm

There were few comments by @dlipovetsky, so I would defer to him for final approval 😅

@dlipovetsky
Copy link
Contributor

I haven't had time to look recently, but will do so this week.

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Jul 5, 2023

Split up into 3 PRs for ease of review.

@cnmcavoy cnmcavoy closed this Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants