-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
short-circuiting-backoff #673
short-circuiting-backoff #673
Conversation
/wip |
Signed-off-by: mshitrit <mshitrit@redhat.com>
a66a371
to
cc01360
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a couple comments, and i have a couple questions.
i see you've added the note about the failed state being paired with the missing noderef and providerid, but i wonder if this feature should be broader in scope? for example, a cluster admin might want to check all mhc remediations, and want to have a blanket timeout for any machine that is going to be remediated. i'm curious if this is something you considered?
the name of this is "short circuiting backoff", but i didn't see mention of the short circuit mechanisms. i imagine that there will be cases where the machines that are held in the failed state will cause the mhc to go above its max unhealthy limit, is this the backoff addressed? assuming so, i think it should be more specifically mentioned.
Hmm I could be wrong, but I think what you are referring to is NodeStartUpTime which already exist.
Good point - Thanks ! |
looking good to me, thanks for the update. @mshitrit is this still wip or are you ready for labels? |
Hi @elmiko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @mshitrit , this generally looks good to me. i added some suggestions that i think help to clarify and a few grammar nits.
Signed-off-by: mshitrit <mshitrit@redhat.com>
314eafd
to
87495d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking good to me. we might remove some of the unused sections, but i'm ok as is.
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the comments I've added, there's some problems I think need exploring further before we can merge around defaulting/disabling the new field and also what happens to existing users when they upgrade, will this change the behaviour for them, is that desirable?
If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow | ||
is preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the implementation you've set a default, this will be incompatible with this statement, you won't be able to remove the value. If you want to have no default, that would actually be preferable as this would then preserve existing behaviour for users who upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
Once we decide what's the best way to proceed regarding default/non default I'll make the proper adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did we decide on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default was removed in the implementation, so I guess that was the decision
openshift/machine-api-operator@e3d7784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
Here is link to the correspondence.
|
||
### Risks and Mitigations | ||
|
||
No known risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, you're setting a default in the implementation and this would affect existing users, may be worth discussing pros/cons of a default here so we know whether to have one or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, here is my take on that:
non default (pro) - naive users aren't being surprised with a new behavior.
default (pro) - naive users do benefit from the new behavior.
I guess the real question here is whether this new behavior benefit all users or not.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?
Co-authored-by: Michael McCune <msm@opbstudios.com>
New changes are detected. LGTM label has been removed. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real objections to this, would like to see some extra details fleshed out, I've highlighted where these are in the document
Did we ever start a conversation upstream about this, if so, would be good to link that into this document too
|
||
### Risks and Mitigations | ||
|
||
No known risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now lets keep no default and maintain the existing behaviour, but can we get these pros/cons fleshed out in the risks/mitigations section within the doc?
Hi @JoelSpeed,
|
We will have to revert as the markdownlint is required, I'll pass this as feedback to the archs though, seems weird to enforce these titles for all enhancements |
ba99f6c
to
16d55ac
Compare
fe61274
to
1657c1f
Compare
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We still want this /reopen |
@slintes: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @JoelSpeed and @elmiko , we want to revive this task. I understand there was a lgtm already on this, but just the linter prevented it to be merged. The linter is green today, anything else left before merging? Thanks in advance 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we propose this feature upstream at all?
If no value for `FailedNodeStartupTimeout` is defined for the MachineHealthCheck CR, the existing remediation flow | ||
is preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did we decide on this one?
... | ||
|
||
// +optional | ||
FailedNodeStartupTimeout metav1.Duration `json:"failedNodeStartupTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this, is Startup
really involved here? Won't this FailedNode
timeout apply to all failed nodes? Should this just be a FailedNodeTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
From reading this enhancement I'd say the same.
From reading the implemenation I'd say Startup
makes sense, because the timeout is applied to the machine's creation timestamp. Not sure if that implementation is correct though. Maybe the timeout should be applied to the time when the machine reached the failed phase? (Not sure if that information is available...).
@beekhof @mshitrit WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the implementation failedNodeStartupTimeout
kicks in for machines which their nodes presumably failed to start, so basically I agree with Marc that the name does make sense assuming the implementation is correct.
Here is the relevant implementation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrew pointed to something: we only apply that timeout when there is no nodeRef or providerId, isn't that implicitely the same as "during startup"? 🤔 @JoelSpeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, yeah, lets make sure that's clear in the proposal because I haven't reviewed the implementation in a while and it wasn't clear to me that this only affects startup, hence the comment. Ok with it staying as is
Not as far as I'm aware of. |
Co-authored-by: Andrew Beekhof <andrew@beekhof.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-read this again today, it's looking mostly good to me but i found one small error in the text.
also, what are we doing about the sections marked "TBD" ?
My folks didn't, but this idea came from the cloud team so maybe one of your people did? |
I would expect it to be supported when shipped. So nothing needed for the TBD sections |
Co-authored-by: Michael McCune <msm@opbstudios.com>
Originally we wanted to remove them, but it caused the build to fail - so decided to keep it as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating the text @mshitrit , and answering the TBD question. i'm good with this
/lgtm
/approve Would like to make sure we have an approval from the dragonfly team on this as well. We should also pursue pushing the same design upstream (kubernetes-sigs/cluster-api#3106) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko, JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for approval and the pointer to the upstream issue, will have a look. /hold cancel |
Signed-off-by: mshitrit mshitrit@redhat.com