-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track stabilization --experimental-wait-cluster-ready-timeout #13785
Comments
This is a new flag being added in 3.6 recently. Normally we should graduate a flag at least in the next release (>=3.7). But for this flag, since it's low risk, I am OK to graduate it in 3.6 if there is no any objections. cc @serathius @spzala @ptabor |
I have bulk created issues to graduate all experimental flag. For this flag that was added v3.6 it should be reasonable to wait for v3.7. |
Hi, not sure if this is the correct place to ask, but noticed one slight issue with this new flag. We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario. I've fixed this in a local install by changing I'm ok with putting up an MR for this, but wanted to check that this was a desired change in the first place, and also suggestions to where to add tests for something like that |
I am not sure I get your point. Do you mean systemd restarted etcd because etcd blocked on serve.go#L105? The PR isn't cherry picked to 3.5. Did you build etcd on |
I tested this with a patch on 3.3.11 because that is the old version I am using but the code in question does not to look to have changed. Specifically, etcd blocks on etcd.go#L208 Which means that it nevers sends the message back to systemd saying it should be considered started and so systemd eventually times it out and restarts it |
Got it. It is a real issue to me. If other members get started too late or slow for whatever reason, then the running member may be restarted by systemd; accordingly it makes the situation even worse. Thanks for raising this. We should add code something like below. Please feel free to deliver a PR for this. But please add an e2e test case.
|
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
I think that MR is good to go but needs someone to allow CI to run |
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
When we can't reach quorum, we were waiting forever and never sending the systemd notify message. As a result, systemd would eventually time out and restart the etcd process which likely would make the unhealthy cluster in an even worse state Improves etcd-io#13785 Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
This is tracking issue that tracks graduation of this feature. Let's work towards stabilization of the feature by discussing what steps should be taken for graduation.
Context #13775
The text was updated successfully, but these errors were encountered: