-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(node/controller): allow to set updateStrategy #740
feat(node/controller): allow to set updateStrategy #740
Conversation
Welcome @lefterisALEX! |
Hi @lefterisALEX. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @jqmichael |
@jqmichael @wongma7 any news about this? |
@jonathanrainer please take a look |
@Ashley-wenyizha @RomanBednar can you please take a look? |
@wongma7 @jqmichael any news about it? |
@Ashley-wenyizha @RomanBednar @jonathanrainer @makomatic could you please take a look? |
@@ -66,6 +66,11 @@ controller: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: {} | |||
# type: RollingUpdate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we add a comment with a strategy that is the default in Kubernetes? We should probably add a comment with good explanation. Something like: "override default strategy (RollingUpdate) to speed up large deployments" - and below add a comment/example of the strategy that provides better result at scale? Same for the node strategy below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RomanBednar i added the comment you suggested. Not sure what to add as an example for large scale deployments , since that might depends of how many worker nodes you have and also what type of nodes (spot or on demand).
@@ -110,6 +115,11 @@ node: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: | |||
type: RollingUpdate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change the strategy for node deployment but only add the possibility to change it since this PR says exactly that. So this should be updateStrategy: {}
and the rest commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -66,6 +66,11 @@ controller: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: {} | |||
# type: RollingUpdate | |||
# rollingUpdate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lefterisALEX What strategy did you use to speed up your deployments? Did you test both and compared the results? Could you maybe share it in the PR description please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done any test to compare speed between different strategies..
With default strategy the rollout take too long and helm times out after 300 second.
We would like to use OnDelete. We are using only spot instances and nodes are replaced frequently.
(will update the description as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand it better, so you tested DaemonSet OnDelete
strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.
Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then i removed the comments in those lines and let only updateStrategy: {}
@lefterisALEX could you please integrate/review @RomanBednar suggestions for the PR? And maybe rebase your branch. |
2e0cbc8
to
aa4ea33
Compare
@RomanBednar your reviews have been integrated, can you take a look again? |
@@ -66,6 +66,8 @@ controller: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: {} | |||
# override default strategy (RollingUpdate) to speed up large deployments" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# override default strategy (RollingUpdate) to speed up large deployments" |
@@ -110,6 +112,7 @@ node: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateStrategy: {} | |
updateStrategy: | |
{} | |
# Override default strategy (RollingUpdate) to speed up deployment. | |
# This can be useful if helm timeouts are observed. | |
# type: OnDelete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here lgtm
@@ -66,6 +66,11 @@ controller: | |||
# cpu: 100m | |||
# memory: 128Mi | |||
nodeSelector: {} | |||
updateStrategy: {} | |||
# type: RollingUpdate | |||
# rollingUpdate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand it better, so you tested DaemonSet OnDelete
strategy only to get past the helm timeout. And the problem does not necessarily come from having huge number of nodes but could be caused by instance type of worker nodes, as you observed.
Then I think leaving controller deployment without any comment is ok because at this point we don't know what alternative to suggest (and controller pods were not causing the issue anyway) and instead it should go under node values - see my suggestions.
@RomanBednar can you please take a look again to the latest commit? |
/lgtm @wongma7 Can you please review an approve? |
Thank you for the patch, I added lgtm label but we still need someone who can approve/merge changes. |
I believe it's because it's sorted lexically. |
also not sure why order is not kept, but since indentation do not change is that a blocking issue? |
@mskanth972 @Ashley-wenyizha @markapruett could someone of you help this PR to reach the finish line? |
@@ -11,6 +11,10 @@ spec: | |||
app: efs-csi-node | |||
app.kubernetes.io/name: {{ include "aws-efs-csi-driver.name" . }} | |||
app.kubernetes.io/instance: {{ .Release.Name }} | |||
{{- with .Values.node.updateStrategy }} | |||
strategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. The field name for DaemonSets is updateStrategy
, not strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, indeed strategy
is for Deployments and for daemonset should be updateStrategy
. will fix it soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@617m4rc fixed, can you please review and resolve if you agree?
6da0678
to
4e18c0b
Compare
pushed a new commit fix the issue @617m4rc spotted. I rebased as well. |
/assign @jqmichael |
/assign @wongma7 |
/assing @mskanth972 |
/ok-to-test |
/retest |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lefterisALEX, mskanth972, pierluigilenoci The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a bug fix or adding new feature?
(New Feature)
This PR allow to define
updateStrategy
forcontroller deployment
andnode daemonset
What is this PR about? / Why do we need it?
With default update strategy takes too long to update all the pods of the daemonset if you have many worker nodes. This results in helm timeout (after 300 second.)
We would like to have the option to tune the
updateStrategy
if needed. Selecting differentupdateStrategy
type or tuning the defaultRollingUpdate
.ps: I see there is an open PR already about it, but since is a bit old ,not sure if someone is still working on it. If that can be merged i can close mine.