Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport pull/13525 to 3.5 #14048

Closed
ahrtr opened this issue May 16, 2022 · 8 comments
Closed

Backport pull/13525 to 3.5 #14048

ahrtr opened this issue May 16, 2022 · 8 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented May 16, 2022

Previously we backported pull/13399 to 3.5. But we do not backport pull/13525 to 3.5. Probably we need to backport pull/13525 to 3.5 as well.

The key point is etcd can't finish the bootstrap/startup process if the quorum isn't satisfied. So it can't serve any client requests, even serializable requests. It is exactly what the PR pull/13525 fixed.

But once etcd finishes the bootstrap/startup process, it can continue to serve serializable requests if the quorum isn't satisfied.

Normally new feature or enhancement should only go into the main branch, and only bug fixed can be backported to previously stable releases. But we have already backported pull/13399 to 3.5 from pragmatic perspective. So personally I'd suggest to backport pull/13525 to 3.5 as well, otherwise etcd still can't serve serializable requests if the quorum isn't satisfied on startup.

cc @ptabor @serathius @spzala for opinions. cc @neolit123

FYI. kubeadm is running into issue without this backport.

@ptabor
Copy link
Contributor

ptabor commented May 23, 2022

This is significant behavioural change to backport IMHO in v3.5, especially if the change would get enabled by default.

I think it's considered a barrier by many services to consider that the deployment is up & running when the server starts accepting connections. Even in etcd tests alone framework/e2e/etcd_process.go we wait for the log line "ready to serve client requests" to asses whether server successfully booted up - and so far it was validating the quorum.

@ahrtr
Copy link
Member Author

ahrtr commented May 23, 2022

Thanks @ptabor for the feedback. Just to be clearer, there are two cases, and we don't support the first one.

  1. Only one member (in a cluster with 3 or more members) is started, then the member can't serve serializable requests such as /health?serializable=true or etcdctl get k1 --consistency="l"; The reason is the member hasn't started to receive client connections.
  2. Only one member is left running (all other members are down for whatever reason), the member can still serve serializable requests. The reason is the member has already finished the startup process.

Currently etcd supports both cases in main branch, but only supports case 2 in 3.5.3+. Whether we backport 13525 depends on whether we should support case 1 in 3.5.

It is indeed a behavior change and a minor feature, so we shouldn't backport it from this perspective. But it doesn't make sense that a member can't even serve serializable requests due to quorum not satisfied, so it's also a bug to me. So we should backport it from this perspective.

Of course, it's open to discussion. Before we reject or approve it, I'd like to get more feedback from others. cc @serathius @spzala @neolit123

@neolit123
Copy link

Looking at the diff for #13525 it seems like a bugfix change and a feature change at the same time. If only the bug fix change can be extracted for a backport the backport would comply with the "no feature backport" rules. E.g. the config change can be omitted and the behaviour can be applied by default with some hardcoded timeout.

That of course depends if the behavior is desired by default and how disruptive it would be for users that are opted-in without consent... If its not considered disruptive and desired by all users, then perhaps there is a middle ground for the backport of the bug fix part.

@ishan16696
Copy link
Contributor

Hi @ahrtr ,
what about using this etcdctl get k1 --consistency="s". Isn't --consistency="s" kindof mimic the /health?serializable=true ?

@ahrtr
Copy link
Member Author

ahrtr commented Jun 1, 2022

Hi @ahrtr , what about using this etcdctl get k1 --consistency="s". Isn't --consistency="s" kindof mimic the /health?serializable=true ?

Yes, both --consistency="s" and /health?serializable=true are issuing serializable request. But --consistency="s" has the same issue (it's exactly what this ticket is discussing) as /health?serializable=true in 3.5 without backporting pull/13525.

@ahrtr
Copy link
Member Author

ahrtr commented Jun 17, 2022

Actually I don't insist on cherry picking the PR to 3.5, and I am OK with any approach here.

No matter whether we cherry pick the PR or not, the proper way to configure probes for a POD should be something like below,

  1. Configure a startup probe, and use the linearlizable health endpoint /health or /health?serializable=false, so that the liveness probe is disabled until the etcd cluster is healthy and working. This is for the startup phase of the etcd cluster. We need to make sure we have a consistency behavior in 3.5, because which is a stable release.
  2. Configure a liveness probe, and use the serializable health endpoint /health?serializable=true, so that K8s only restarts a POD when the local etcd member isn't healthy.

@ptabor @serathius @spzala Could you please vote which approach (see below) do you prefer to?

  1. Cherry pick pull/13525 to 3.5;
  2. Do not cherry pick pull/13525 to 3.5;

@neolit123 Is it OK for you to update the test case or setup/configuration in Kubeadm if we decide not to cherry pick this PR?

@neolit123
Copy link

@neolit123 Is it OK for you to update the test case or setup/configuration in Kubeadm if we decide not to cherry pick this PR?

I can update kubeadm if this is not cherry picked, yes. Currenly it has the serialized check for both startup and liveness probes.

@ahrtr
Copy link
Member Author

ahrtr commented Jun 21, 2022

There is no any interests on cherry picking the PR to 3.5, so closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants