feat(node/controller): allow to set updateStrategy #740

lefterisALEX · 2022-07-15T11:57:17Z

Is this a bug fix or adding new feature?
(New Feature)
This PR allow to define updateStrategy for controller deployment and node daemonset

What is this PR about? / Why do we need it?
With default update strategy takes too long to update all the pods of the daemonset if you have many worker nodes. This results in helm timeout (after 300 second.)

We would like to have the option to tune the updateStrategy if needed. Selecting different updateStrategy type or tuning the defaultRollingUpdate.

ps: I see there is an open PR already about it, but since is a bit old ,not sure if someone is still working on it. If that can be merged i can close mine.

k8s-ci-robot · 2022-07-15T11:57:25Z

Welcome @lefterisALEX!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-07-15T11:57:25Z

Hi @lefterisALEX. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lefterisALEX · 2022-08-01T09:39:44Z

/assign @jqmichael

pierluigilenoci · 2022-08-05T07:32:55Z

@jqmichael @wongma7 any news about this?

pierluigilenoci · 2022-08-05T07:37:33Z

@jonathanrainer please take a look

pierluigilenoci · 2022-08-05T07:51:04Z

@Ashley-wenyizha @RomanBednar can you please take a look?

lefterisALEX · 2022-08-31T04:44:10Z

@wongma7 @jqmichael any news about it?

pierluigilenoci · 2022-08-31T08:17:08Z

@Ashley-wenyizha @RomanBednar @jonathanrainer @makomatic could you please take a look?

RomanBednar · 2022-09-06T07:30:32Z

charts/aws-efs-csi-driver/values.yaml

@@ -66,6 +66,11 @@ controller:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: {}
+ # type: RollingUpdate


Why would we add a comment with a strategy that is the default in Kubernetes? We should probably add a comment with good explanation. Something like: "override default strategy (RollingUpdate) to speed up large deployments" - and below add a comment/example of the strategy that provides better result at scale? Same for the node strategy below.

@RomanBednar i added the comment you suggested. Not sure what to add as an example for large scale deployments , since that might depends of how many worker nodes you have and also what type of nodes (spot or on demand).

RomanBednar · 2022-09-06T07:33:41Z

charts/aws-efs-csi-driver/values.yaml

@@ -110,6 +115,11 @@ node:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: 
+ type: RollingUpdate


We should not change the strategy for node deployment but only add the possibility to change it since this PR says exactly that. So this should be updateStrategy: {} and the rest commented out.

RomanBednar · 2022-09-06T07:33:43Z

charts/aws-efs-csi-driver/values.yaml

@@ -66,6 +66,11 @@ controller:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: {}
+ # type: RollingUpdate
+ # rollingUpdate:


@lefterisALEX What strategy did you use to speed up your deployments? Did you test both and compared the results? Could you maybe share it in the PR description please?

I haven't done any test to compare speed between different strategies..
With default strategy the rollout take too long and helm times out after 300 second.
We would like to use OnDelete. We are using only spot instances and nodes are replaced frequently.
(will update the description as well)

Now I understand it better, so you tested DaemonSet OnDelete strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.

Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.

ok, then i removed the comments in those lines and let only updateStrategy: {}

pierluigilenoci · 2022-09-06T11:29:25Z

@lefterisALEX could you please integrate/review @RomanBednar suggestions for the PR? And maybe rebase your branch.

pierluigilenoci · 2022-09-06T12:53:47Z

@RomanBednar your reviews have been integrated, can you take a look again?

RomanBednar · 2022-09-06T14:35:12Z

charts/aws-efs-csi-driver/values.yaml

@@ -66,6 +66,8 @@ controller:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: {}
+ # override default strategy (RollingUpdate) to speed up large deployments" 


Suggested change

# override default strategy (RollingUpdate) to speed up large deployments"

RomanBednar · 2022-09-06T14:35:15Z

charts/aws-efs-csi-driver/values.yaml

@@ -110,6 +112,7 @@ node:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: {}


Suggested change

updateStrategy: {}

updateStrategy:

{}

# Override default strategy (RollingUpdate) to speed up deployment.

# This can be useful if helm timeouts are observed.

# type: OnDelete

RomanBednar · 2022-09-06T14:35:27Z

charts/aws-efs-csi-driver/values.yaml

@@ -66,6 +66,11 @@ controller:
 # cpu: 100m
 # memory: 128Mi
 nodeSelector: {}
+ updateStrategy: {}
+ # type: RollingUpdate
+ # rollingUpdate:


Now I understand it better, so you tested DaemonSet OnDelete strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.

Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.

lefterisALEX · 2022-09-07T13:03:54Z

@RomanBednar can you please take a look again to the latest commit?

RomanBednar · 2022-09-07T14:03:42Z

/lgtm

@wongma7 Can you please review an approve?

RomanBednar · 2022-09-07T14:05:30Z

@RomanBednar can you please take a look again to the latest commit?

Thank you for the patch, I added lgtm label but we still need someone who can approve/merge changes.

dawid-remitly · 2023-02-21T11:56:37Z

Not sure why order is backwards

I believe it's because it's sorted lexically.
Guys, Is there anything else this PR is waiting? I miss this feature so hard ... and it seems to be a simple change...

lefterisALEX · 2023-02-21T13:58:55Z

also not sure why order is not kept, but since indentation do not change is that a blocking issue?

pierluigilenoci · 2023-02-21T14:53:04Z

@mskanth972 @Ashley-wenyizha @markapruett could someone of you help this PR to reach the finish line?

617m4rc · 2023-02-22T08:52:54Z

charts/aws-efs-csi-driver/templates/node-daemonset.yaml

@@ -11,6 +11,10 @@ spec:
 app: efs-csi-node
 app.kubernetes.io/name: {{ include "aws-efs-csi-driver.name" . }}
 app.kubernetes.io/instance: {{ .Release.Name }}
+ {{- with .Values.node.updateStrategy }}
+ strategy:


This is incorrect. The field name for DaemonSets is updateStrategy, not strategy

good catch, indeed strategy is for Deployments and for daemonset should be updateStrategy. will fix it soon

@617m4rc fixed, can you please review and resolve if you agree?

lefterisALEX · 2023-02-22T12:16:48Z

pushed a new commit fix the issue @617m4rc spotted. I rebased as well.

#740 (comment)

lefterisALEX · 2023-02-22T14:03:43Z

/assign @jqmichael

lefterisALEX · 2023-02-22T14:03:55Z

/assign @wongma7

lefterisALEX · 2023-02-22T14:04:20Z

/assing @mskanth972

mskanth972 · 2023-02-22T21:44:48Z

/ok-to-test

RyanStan · 2023-02-22T23:15:48Z

/retest

mskanth972 · 2023-02-23T15:14:57Z

/lgtm

mskanth972 · 2023-02-23T15:15:09Z

/approve

k8s-ci-robot · 2023-02-23T15:15:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lefterisALEX, mskanth972, pierluigilenoci

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mskanth972]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 15, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 15, 2022

k8s-ci-robot requested review from jqmichael and wongma7 July 15, 2022 11:57

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 15, 2022

lefterisALEX mentioned this pull request Jul 15, 2022

Allow seting updateStrategy #739

Closed

k8s-ci-robot assigned jqmichael Aug 1, 2022

pierluigilenoci mentioned this pull request Aug 5, 2022

[FR] efs-csi-node DaemonSet UpdateStrategy configuration #634

Closed

RomanBednar suggested changes Sep 6, 2022

View reviewed changes

lefterisALEX force-pushed the allow_to_set_updateStrategy branch from 2e0cbc8 to aa4ea33 Compare September 6, 2022 12:16

RomanBednar reviewed Sep 6, 2022

View reviewed changes

k8s-ci-robot assigned RomanBednar Sep 7, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2022

pierluigilenoci mentioned this pull request Feb 21, 2023

add updateStrategy #615

Closed

617m4rc reviewed Feb 22, 2023

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2023

feat(node/controller): allow to set updateStrategy

4e18c0b

lefterisALEX force-pushed the allow_to_set_updateStrategy branch from 6da0678 to 4e18c0b Compare February 22, 2023 12:12

lefterisALEX requested review from RomanBednar and 617m4rc and removed request for wongma7, jqmichael, RomanBednar and 617m4rc February 22, 2023 14:02

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 22, 2023

k8s-ci-robot assigned mskanth972 Feb 23, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2023

k8s-ci-robot merged commit 0d5059d into kubernetes-sigs:master Feb 23, 2023

- updateStrategy: {}
+ updateStrategy:
+ {}
+ # Override default strategy (RollingUpdate) to speed up deployment.
+ # This can be useful if helm timeouts are observed.
+ # type: OnDelete

feat(node/controller): allow to set updateStrategy #740

feat(node/controller): allow to set updateStrategy #740

Conversation

lefterisALEX commented Jul 15, 2022 • edited Loading

k8s-ci-robot commented Jul 15, 2022

k8s-ci-robot commented Jul 15, 2022

lefterisALEX commented Aug 1, 2022

pierluigilenoci commented Aug 5, 2022

pierluigilenoci commented Aug 5, 2022

pierluigilenoci commented Aug 5, 2022

lefterisALEX commented Aug 31, 2022

pierluigilenoci commented Aug 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RomanBednar Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierluigilenoci commented Sep 6, 2022

pierluigilenoci commented Sep 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lefterisALEX commented Sep 7, 2022

RomanBednar commented Sep 7, 2022

RomanBednar commented Sep 7, 2022

dawid-remitly commented Feb 21, 2023

lefterisALEX commented Feb 21, 2023

pierluigilenoci commented Feb 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lefterisALEX commented Feb 22, 2023

lefterisALEX commented Feb 22, 2023

lefterisALEX commented Feb 22, 2023

lefterisALEX commented Feb 22, 2023

mskanth972 commented Feb 22, 2023

RyanStan commented Feb 22, 2023

mskanth972 commented Feb 23, 2023

mskanth972 commented Feb 23, 2023

k8s-ci-robot commented Feb 23, 2023

lefterisALEX commented Jul 15, 2022 •

edited

Loading

RomanBednar Sep 6, 2022 •

edited

Loading