ETCD-236: etcd: scaling etcd with raft learners #920

hexfusion · 2021-10-05T13:43:45Z

This PR outlines an enhancement to change scale-up operations of the etcd cluster to use raft learners. This work will assist with Control Plane Scaling and Recovery[1].

POC: openshift/cluster-etcd-operator#682

[1] https://issues.redhat.com/browse/OCPPLAN-5712

openshift-ci · 2021-10-05T13:44:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ashcrow after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

lilic

Looking good! 🎉

enhancements/etcd/scaling-etcd-with-raft-learners.md

lilic · 2021-10-05T15:12:59Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+
+- e2e with
+
+    1. add dangling etcd learner member which has not been started and not promoted.


Curious how will we prevent a learner from being promoted? Otherwise sounds great!

One idea is to use IPTables to block peer traffic tcp port 2380 so the leader can't update the learner.

enhancements/etcd/scaling-etcd-with-raft-learners.md

aravindhp · 2021-10-05T17:04:17Z

/uncc

enhancements/etcd/scaling-etcd-with-raft-learners.md

JoelSpeed · 2021-10-08T13:08:49Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+We will add the following critical alerts:
+
+- alert about learner member which has not been promoted or started > 30s
+- alert if the number of etcd members is not equal to the number of quorum-guard pods > 30s


Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?

Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?

Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?

members of quorum vs non voting?

Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?

The idea is to ensure all voting members are sane and that the expected number of voting members is expected. So I believe in this case PDB should be degraded while the learner scales up. Because if quorum-guard was scheduled the expectation is for etcd to scale up as well. A failure to scale would be directly observed in a failure of PDB tolerations. IE don't drain a node if the learner has not yet been promoted feels sane.

Cool, that makes sense, might be worth adding that context into the enhancement as it wasn't obvious to me why learners still need quorum guard

This still might need more details but I did add more context. WIll leave this open for now.

enhancements/etcd/scaling-etcd-with-raft-learners.md

JoelSpeed · 2021-10-08T13:29:31Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+
+- **Split brain**: The cluster gates starting of the etcd process on a verification process based on the cluster
+member id. This ensures that each revision has an explicit expected membership. Because quorum guard replicas are
+managed by the operator the cluster topology will remain in a safe configuration (odd number of members).


Does odd number of members hold true? There was a comment above about ensuring that num etcd replicas == num quorum guard, but this may not always be odd when there's a learner "learning"

Since we surge up to 4 there will be a window where we have an even number of nodes for a short duration. I will call that out and reasons for this.

Still outstanding.

enhancements/etcd/scaling-etcd-with-raft-learners.md

jewzaam · 2021-10-20T14:47:55Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+
+### Goals
+
+- provide safe scale up of etcd cluster using raft learners.


Vertical scaling is a capability we need for managed platforms. Is this scale up in support of changing instance types for the underlying master? And can be followed by a graceful and clean scale down of older masters?

Yes learner support allows etcd to safely scale up new members. It does this by:

Allowing a new etcd member to join the cluster without impacting quorum as a non voting member.

The new member can not be promoted to a voting member until its log is in sync with the leader. This reduces exposure to data loss.

Both of those steps are critical for vertical scaling, protecting quorum and protecting data loss. The scale-down process will be a separate enhancement but this enhancement is a dependency.

enhancements/etcd/scaling-etcd-with-raft-learners.md

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

openshift-ci · 2021-10-26T14:55:32Z

@hexfusion: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`22ad8ce`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cblecker · 2021-10-26T15:53:33Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+Proposed format of endpoints `<etcd member ID>: InternalIP`
+```yaml
+data:
+  13c8677970c567e2: 10.0.185.161,


Are the commas intentional here? It's not clear from L115

no that is a typo thanks

enhancements/etcd/scaling-etcd-with-raft-learners.md

jharrington22 · 2021-10-26T16:26:47Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+
+1. As a cluster-admin I want to scale up etcd without fear of quorum loss.
+
+2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd.


Suggested change

2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd.

2. As a cluster-admin I want to be able to replace failed control-plane nodes without a deep understanding of etcd.

JoelSpeed · 2021-10-27T16:35:14Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+
+We add new metric figures to the etcd dashboard in the console:
+
+1. include membership status over time `is_leader`, `is_learner` and `has_leader`.


What is is_leader vs has_leader? Does has_leader mean that it's a follower (voting member)? Would is_follower be better?

These are currently metrics exposed by etcd. I will add a reference link to the doc.

https://github.com/etcd-io/etcd/blob/1e46145b29764f9e65f4164bb777fbf275a63d8a/server/etcdserver/metrics.go#L35

JoelSpeed · 2021-10-27T16:37:59Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+via `MemberList`. This allows for only pending or existing etcd members to be reported as endpoints. While the 
+endpoint at the time of publishing could be a learner or unstarted the client balancer will ensure service. 
+
+A small change to the format of the key is also proposed to provide the cluster with more details on the member 


Is this a change that can be made transparently without breaking anything during upgrades? This is an internal API right?

Correct the key itself is not used by OpenShift today

JoelSpeed · 2021-10-27T16:40:37Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+6. Scale CVO and `cluster-etcd-operator` back up.
+7. Force a new revision.
+
+### Test 1: Single Learner Scale UP 5GB State


5GB feels like a lot of data, do we have a rough estimate of the average OCP etcd data size? Trying to work out if this is a mean case or worst case kind of estimate

I am going to dig that up through telemetry but the goal was to show a larger production cluster usecase. My guess is that most clusters in the fleet are 3GB or less.

hasbro17

SGTM so far.
Just have a question on how the desired membership is determined.

hasbro17 · 2021-10-27T21:09:43Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+`cluster-etcd-operator`.
+
+To populate this value the controller will read etcd-endpoints ConfigMap. This aligns scaling across the controllers.
+new revisions of the static-pod controller will also use this configmap as the source of truth for scaling.


Nit

Suggested change

new revisions of the static-pod controller will also use this configmap as the source of truth for scaling.

New revisions of the static-pod controller will also use this configmap as the source of truth for scaling.

hasbro17 · 2021-10-27T21:51:37Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are 
+part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old. 


Suggested change

the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are

part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old.

the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are

part of the existing quorum, the intent becomes clear that we will add the new Nodes to the cluster while removing the old.

I'm probably missing something but just to understand better, who is setting the desired cluster membership in this scenario? Or where is that being set?

I'm trying to understand how we know whether to vertically or horizontally scale when the cluster member controller sees new nodes. Right now the clustermember controller will just add static pods from all new nodes as members, but with this design, are we going to infer from the node count of 6 that the desired cluster membership comprises of the pods on the 3 new nodes?

Great question. today vertical scaling is not supported as the desired state of the control-plane is considered to be immutable. So in the context of the scale-down feature. There is an assumption that new nodes will replace old nodes and we will reconcile to the desired controlPlane replica count defined in the install-config.

So a 3 node cluster if you added 3 new nodes etcd will eventually migrate membership to the new nodes. But something missing from this doc is what happens if you add 2 nodes. Then later add another, this is a great test scenario.

I believe we will need to use FIFO approach with scale down and replace the oldest node first. I will make sure to describe this more completely in that enhancement. For this enhancement, I think its out of scope.

Note: if in the future horizontal scaling is supported we will need an API to describe the desired control plane replicas. In this design we would use that state as the desired cluster size.

FTR the install-config we read from is embedded into kube-system/cluster-config-v1 configmap

Thanks for clarifying.

And I guess this is more of a question for the scaling down proposal but this makes me think we're keeping a history of nodes that were previously part of the cluster (not sure if etcd already keeps track of membership history). Otherwise we could end up adding a previously removed node back into the cluster if it becomes visible to the clustermember controller.

hasbro17 · 2021-10-27T21:52:35Z

enhancements/etcd/scaling-etcd-with-raft-learners.md

+### Quorum Guard Controller
+
+The `Quorum Guard Controller` ensures that the PDB deployment reflects the desired state of the cluster. To do that 
+it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as 


Nit

Suggested change

it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as

it must understand the desired control plane replicas which is consumed from the install-config. Today as soon as

openshift-bot · 2021-11-26T04:26:22Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-12-03T04:30:10Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-12-10T04:36:50Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-12-10T04:37:20Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested a review from aravindhp October 5, 2021 13:44

openshift-ci bot requested a review from ashcrow October 5, 2021 13:44

hexfusion mentioned this pull request Oct 5, 2021

[POC] pkg/operator/clustermembercontroller: scale up using raft learners openshift/cluster-etcd-operator#682

Closed

etcd: scaling etcd with raft learners

79dc354

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion force-pushed the learner-scale branch from 6129ae5 to 79dc354 Compare October 5, 2021 14:09

lilic reviewed Oct 5, 2021

View reviewed changes

openshift-ci bot removed the request for review from aravindhp October 5, 2021 17:04

hexfusion changed the title ~~etcd: scaling etcd with raft learners~~ ETCD-236: etcd: scaling etcd with raft learners Oct 7, 2021

hexfusion commented Oct 7, 2021

View reviewed changes

enhancements/etcd/scaling-etcd-with-raft-learners.md Show resolved Hide resolved

JoelSpeed reviewed Oct 8, 2021

View reviewed changes

jewzaam reviewed Oct 20, 2021

View reviewed changes

hexfusion force-pushed the learner-scale branch from 3b50681 to ed9f172 Compare October 26, 2021 14:14

Add performance data and harden details of proposal

22ad8ce

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion force-pushed the learner-scale branch from ed9f172 to 22ad8ce Compare October 26, 2021 14:48

cblecker reviewed Oct 26, 2021

View reviewed changes

jharrington22 reviewed Oct 26, 2021

View reviewed changes

JoelSpeed reviewed Oct 27, 2021

View reviewed changes

hasbro17 reviewed Oct 27, 2021

View reviewed changes

hasbro17 mentioned this pull request Oct 29, 2021

[OCPCLOUD-1244] Add proposal for etcd protection mechanism during control plane scaling #943

Merged

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 3, 2021

openshift-ci bot closed this Dec 10, 2021


		- e2e with

		1. add dangling etcd learner member which has not been started and not promoted.


		### Goals

		- provide safe scale up of etcd cluster using raft learners.


		1. As a cluster-admin I want to scale up etcd without fear of quorum loss.

		2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd.


		We add new metric figures to the etcd dashboard in the console:

		1. include membership status over time `is_leader`, `is_learner` and `has_leader`.

	new revisions of the static-pod controller will also use this configmap as the source of truth for scaling.
	New revisions of the static-pod controller will also use this configmap as the source of truth for scaling.

		the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are
		part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old.

	it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as
	it must understand the desired control plane replicas which is consumed from the install-config. Today as soon as

ETCD-236: etcd: scaling etcd with raft learners #920

ETCD-236: etcd: scaling etcd with raft learners #920

Conversation

hexfusion commented Oct 5, 2021

openshift-ci bot commented Oct 5, 2021

lilic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexfusion Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

aravindhp commented Oct 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexfusion Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasbro17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Nov 26, 2021

openshift-bot commented Dec 3, 2021

openshift-bot commented Dec 10, 2021

openshift-ci bot commented Dec 10, 2021

hexfusion Oct 5, 2021 •

edited

Loading

hexfusion Oct 27, 2021 •

edited

Loading