Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(node/controller): allow to set updateStrategy #740

Merged

Conversation

lefterisALEX
Copy link
Contributor

@lefterisALEX lefterisALEX commented Jul 15, 2022

Is this a bug fix or adding new feature?
(New Feature)
This PR allow to define updateStrategy for controller deployment and node daemonset

What is this PR about? / Why do we need it?
With default update strategy takes too long to update all the pods of the daemonset if you have many worker nodes. This results in helm timeout (after 300 second.)

We would like to have the option to tune the updateStrategy if needed. Selecting different updateStrategy type or tuning the defaultRollingUpdate.

ps: I see there is an open PR already about it, but since is a bit old ,not sure if someone is still working on it. If that can be merged i can close mine.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 15, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @lefterisALEX!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @lefterisALEX. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 15, 2022
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 15, 2022
@lefterisALEX
Copy link
Contributor Author

/assign @jqmichael

@pierluigilenoci
Copy link
Contributor

@jqmichael @wongma7 any news about this?

@pierluigilenoci
Copy link
Contributor

@jonathanrainer please take a look

@pierluigilenoci
Copy link
Contributor

@Ashley-wenyizha @RomanBednar can you please take a look?

@lefterisALEX
Copy link
Contributor Author

@wongma7 @jqmichael any news about it?

@pierluigilenoci
Copy link
Contributor

@Ashley-wenyizha @RomanBednar @jonathanrainer @makomatic could you please take a look?

@@ -66,6 +66,11 @@ controller:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy: {}
# type: RollingUpdate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we add a comment with a strategy that is the default in Kubernetes? We should probably add a comment with good explanation. Something like: "override default strategy (RollingUpdate) to speed up large deployments" - and below add a comment/example of the strategy that provides better result at scale? Same for the node strategy below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RomanBednar i added the comment you suggested. Not sure what to add as an example for large scale deployments , since that might depends of how many worker nodes you have and also what type of nodes (spot or on demand).

@@ -110,6 +115,11 @@ node:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy:
type: RollingUpdate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the strategy for node deployment but only add the possibility to change it since this PR says exactly that. So this should be updateStrategy: {} and the rest commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -66,6 +66,11 @@ controller:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy: {}
# type: RollingUpdate
# rollingUpdate:
Copy link
Contributor

@RomanBednar RomanBednar Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lefterisALEX What strategy did you use to speed up your deployments? Did you test both and compared the results? Could you maybe share it in the PR description please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done any test to compare speed between different strategies..
With default strategy the rollout take too long and helm times out after 300 second.
We would like to use OnDelete. We are using only spot instances and nodes are replaced frequently.
(will update the description as well)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand it better, so you tested DaemonSet OnDelete strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.

Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then i removed the comments in those lines and let only updateStrategy: {}

@pierluigilenoci
Copy link
Contributor

@lefterisALEX could you please integrate/review @RomanBednar suggestions for the PR? And maybe rebase your branch.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 6, 2022
@pierluigilenoci
Copy link
Contributor

@RomanBednar your reviews have been integrated, can you take a look again?

@@ -66,6 +66,8 @@ controller:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy: {}
# override default strategy (RollingUpdate) to speed up large deployments"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# override default strategy (RollingUpdate) to speed up large deployments"

@@ -110,6 +112,7 @@ node:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
updateStrategy: {}
updateStrategy:
{}
# Override default strategy (RollingUpdate) to speed up deployment.
# This can be useful if helm timeouts are observed.
# type: OnDelete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here lgtm

@@ -66,6 +66,11 @@ controller:
# cpu: 100m
# memory: 128Mi
nodeSelector: {}
updateStrategy: {}
# type: RollingUpdate
# rollingUpdate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand it better, so you tested DaemonSet OnDelete strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.

Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.

@lefterisALEX
Copy link
Contributor Author

@RomanBednar can you please take a look again to the latest commit?

@RomanBednar
Copy link
Contributor

/lgtm

@wongma7 Can you please review an approve?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2022
@RomanBednar
Copy link
Contributor

@RomanBednar can you please take a look again to the latest commit?

Thank you for the patch, I added lgtm label but we still need someone who can approve/merge changes.

@dawid-remitly
Copy link

Not sure why order is backwards

I believe it's because it's sorted lexically.
Guys, Is there anything else this PR is waiting? I miss this feature so hard ... and it seems to be a simple change...

@lefterisALEX
Copy link
Contributor Author

also not sure why order is not kept, but since indentation do not change is that a blocking issue?

@pierluigilenoci
Copy link
Contributor

@mskanth972 @Ashley-wenyizha @markapruett could someone of you help this PR to reach the finish line?

@@ -11,6 +11,10 @@ spec:
app: efs-csi-node
app.kubernetes.io/name: {{ include "aws-efs-csi-driver.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- with .Values.node.updateStrategy }}
strategy:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. The field name for DaemonSets is updateStrategy, not strategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, indeed strategy is for Deployments and for daemonset should be updateStrategy. will fix it soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@617m4rc fixed, can you please review and resolve if you agree?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2023
@lefterisALEX
Copy link
Contributor Author

pushed a new commit fix the issue @617m4rc spotted. I rebased as well.

#740 (comment)

@lefterisALEX lefterisALEX requested review from RomanBednar and 617m4rc and removed request for wongma7, jqmichael, RomanBednar and 617m4rc February 22, 2023 14:02
@lefterisALEX
Copy link
Contributor Author

/assign @jqmichael

@lefterisALEX
Copy link
Contributor Author

/assign @wongma7

@lefterisALEX
Copy link
Contributor Author

/assing @mskanth972

@mskanth972
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 22, 2023
@RyanStan
Copy link
Contributor

/retest

@mskanth972
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2023
@mskanth972
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lefterisALEX, mskanth972, pierluigilenoci

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0d5059d into kubernetes-sigs:master Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.