Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/aws asg unsafe decommission 5829 #6911

Merged
merged 6 commits into from
Jun 26, 2024

Conversation

ravisinha0506
Copy link
Contributor

@ravisinha0506 ravisinha0506 commented Jun 10, 2024

This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments.

What type of PR is this?

/kind bug

Optionally add one or more of the following kinds if applicable:
/kind api-change
/kind deprecation
/kind failing-test
/kind flake
/kind regression
-->

What this PR does / why we need it:

This PR prevents the Kubernetes Cluster Autoscaler from erroneously decommissioning actual nodes during scale-down operations in AWS environments, which could lead to unintended service disruptions.

Which issue(s) this PR fixes:

Fixes #5829

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix an issue in the Kubernetes Cluster Autoscaler where actual AWS instances could be incorrectly scaled down instead of placeholders.

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/bug Categorizes issue or PR as related to a bug. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/deprecation Categorizes issue or PR as related to a feature/enhancement marked for deprecation. labels Jun 10, 2024
Copy link

linux-foundation-easycla bot commented Jun 10, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 10, 2024
@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 10, 2024
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler kind/regression Categorizes issue or PR as related to a regression from a prior release. area/provider/aws Issues or PRs related to aws provider cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 10, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @ravisinha0506. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 10, 2024
@drmorr0
Copy link
Contributor

drmorr0 commented Jun 10, 2024

Hi, can you please keep all of the changes in a single PR? It's very difficult in GH to compare changes and responses to feedback (or even remember what the feedback was) if you open a new PR for every change.

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@ravisinha0506
Copy link
Contributor Author

Hi, can you please keep all of the changes in a single PR? It's very difficult in GH to compare changes and responses to feedback (or even remember what the feedback was) if you open a new PR for every change.

If you are referring to this PR, it was authored by a different person. Hence, I have created a new PR for this change.

Regarding the PR you mentioned (#6818), it was submitted by someone else. Therefore, I have opened a new pull request for the proposed modification.

@drmorr0
Copy link
Contributor

drmorr0 commented Jun 11, 2024

I don't understand. Multiple people can commit things to a single PR, and this change is clearly based on the previous one. It's going to be very difficult to review if a new PR gets opened every single time a change is made (right now I don't even know which one is the correct one to review, since they are both still open).

I can add some comments here if this is going to be the PR to reference in the future, but in that case can we please close the other one? Also please make sure we have a CLA signed and fix the issues with the linter in test-and-verify.

@kmsarabu
Copy link

Hi @drmorr0, Sorry for the multiple submissions. We have now closed the old PR and will use this PR for the review going forward. Please review and provide feedback. Thanks.

placeHolderInstancesCount, commonAsg.Name)

// Retrieve the most recent scaling activity to determine its success state.
isRecentScalingActivitySuccess, err = m.getMostRecentScalingActivity(commonAsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meeting, we talked about not getting the scaling activity at all, just getting the list of instances in the ASG and seeing if the number of placeholders was incorrect. I think we can just get rid of this function entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some times scaling activities can take longer to complete, so I wanted to ensure that scaling has indeed failed. But since we are looking at total active count if ec2 instances, I think it should be okay to remove this logic.

placeHolderInstancesCount, commonAsg.Name)
return nil
} else {
asgDetail, err := m.getDescribeAutoScalingGroupResults(commonAsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call m.awsService.getAutoscalingGroupsByNames here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -352,6 +406,33 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {
return nil
}

func (m *asgCache) getDescribeAutoScalingGroupResults(commonAsg *asg) (*autoscaling.Group, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove, we don't need this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced with existing getAutoscalingGroupByName api

@@ -624,3 +705,55 @@ func (m *asgCache) buildInstanceRefFromAWS(instance *autoscaling.Instance) AwsIn
func (m *asgCache) Cleanup() {
close(m.interrupt)
}

func (m *asgCache) getMostRecentScalingActivity(asg *asg) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 14, 2024
@dims
Copy link
Member

dims commented Jun 17, 2024

/ok-to-test
/release-note-none

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2024
@dims
Copy link
Member

dims commented Jun 17, 2024

/assign @drmorr0

@k8s-ci-robot k8s-ci-robot removed kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/deprecation Categorizes issue or PR as related to a feature/enhancement marked for deprecation. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Jun 17, 2024
@drmorr0
Copy link
Contributor

drmorr0 commented Jun 18, 2024

/remove-area vertical-pod-autoscaler

Copy link
Contributor

@drmorr0 drmorr0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think through the implications of this again. In the current version of code, the ASG size got decremented each time a placeholder was removed, which led to (potentially) some real instances getting terminated. This was problematic for the reasons we've discussed, but one "positive" outcome is that the ASG size ended up being the "correct" size from cluster autoscaler's perspective.

With this change, we will no longer delete real instances, but if cluster autoscaler asks the cloud provider to delete 10 instances, there is a possibility that fewer of them will be deleted than requested. As best as I can tell, cluster autoscaler doesn't actually care. It looks as though the AWS provider schedules a refresh of the ASG cache on the next main loop iteration, so at that point it should be up to date, and as far as I can tell nothing else really depends on "all the nodes" being deleted before the start of the next main loop.

But, my point is I'd like a second opinion on this change -- @gjtempleton are you available to take a look?

@@ -603,7 +602,7 @@ func TestDeleteNodesWithPlaceholder(t *testing.T) {
err = asgs[0].DeleteNodes([]*apiv1.Node{node})
assert.NoError(t, err)
a.AssertNumberOfCalls(t, "SetDesiredCapacity", 1)
a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 1)
a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 2)
Copy link
Contributor

@drmorr0 drmorr0 Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another test here that verifies the behaviour we're expecting -- specifically, create an ASG with (say) 3 instances and a desired capacity of (say) 10, and then call DeleteNodes with 7 placeholders and 2 real nodes. Then we should see that the target size of the ASG goes down to 4.

For example, say the ASG has initially has nodes [i-0000, i-0001, i-0002] and we set the desired capacity from 3 to 10. Cluster Autoscaler will update its asgCache to include [i-0000, i-0001, i-0002, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. AWS is able to satisfy three of the requests to get a "real" state of [i-0000, i-0001, i-0002, i-0003, i-0004, i-0005, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. However, CA hasn't updated its cache. Then it calls DeleteInstances with the following list of instances: [i-0000, i-0001, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. The result of this DeleteInstances call should be that the ASG has nodes [i-0002, i-0003, i-0004, i-0005] and a desired capacity of 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have added context on the unit test to avoid any confusion.

@kmsarabu
Copy link

/test all

@k8s-ci-robot
Copy link
Contributor

@kmsarabu: No jobs can be run with /test all.
The following commands are available to trigger optional jobs:

  • /test pull-cluster-autoscaler-e2e-azure
  • /test pull-kubernetes-e2e-autoscaling-vpa-full

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@drmorr0
Copy link
Contributor

drmorr0 commented Jun 24, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2024
@drmorr0
Copy link
Contributor

drmorr0 commented Jun 24, 2024

/assign @gjtempleton

@drmorr0
Copy link
Contributor

drmorr0 commented Jun 24, 2024

I think this addresses my concerns and is a better state than what we have now, but I'd still like a second set of eyes on it before we merge.

Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, but generally happy enough that the trade-off mentioned by @drmorr0 is only potentially going to result in instances hanging around for one extra iteration, so small increase in cost.}

Happy to merge once the log line's been updated.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2024
@gjtempleton
Copy link
Member

Thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gjtempleton, ravisinha0506

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit 92258fb into kubernetes:master Jun 26, 2024
6 checks passed
kmsarabu added a commit to kmsarabu/autoscaler that referenced this pull request Jul 8, 2024
kmsarabu added a commit to kmsarabu/autoscaler that referenced this pull request Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances
7 participants