Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-236: etcd: scaling etcd with raft learners #920

Closed
wants to merge 2 commits into from

Conversation

hexfusion
Copy link
Contributor

This PR outlines an enhancement to change scale-up operations of the etcd cluster to use raft learners. This work will assist with Control Plane Scaling and Recovery[1].

POC: openshift/cluster-etcd-operator#682

[1] https://issues.redhat.com/browse/OCPPLAN-5712

@openshift-ci openshift-ci bot requested a review from aravindhp October 5, 2021 13:44
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ashcrow after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! 🎉

enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved

- e2e with

1. add dangling etcd learner member which has not been started and not promoted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how will we prevent a learner from being promoted? Otherwise sounds great!

Copy link
Contributor Author

@hexfusion hexfusion Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea is to use IPTables to block peer traffic tcp port 2380 so the leader can't update the learner.

@aravindhp
Copy link
Contributor

/uncc

@openshift-ci openshift-ci bot removed the request for review from aravindhp October 5, 2021 17:04
@hexfusion hexfusion changed the title etcd: scaling etcd with raft learners ETCD-236: etcd: scaling etcd with raft learners Oct 7, 2021
enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved
enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved
enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved
We will add the following critical alerts:

- alert about learner member which has not been promoted or started > 30s
- alert if the number of etcd members is not equal to the number of quorum-guard pods > 30s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?

Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?

members of quorum vs non voting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?

The idea is to ensure all voting members are sane and that the expected number of voting members is expected. So I believe in this case PDB should be degraded while the learner scales up. Because if quorum-guard was scheduled the expectation is for etcd to scale up as well. A failure to scale would be directly observed in a failure of PDB tolerations. IE don't drain a node if the learner has not yet been promoted feels sane.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that makes sense, might be worth adding that context into the enhancement as it wasn't obvious to me why learners still need quorum guard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still might need more details but I did add more context. WIll leave this open for now.

enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved
enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved

- **Split brain**: The cluster gates starting of the etcd process on a verification process based on the cluster
member id. This ensures that each revision has an explicit expected membership. Because quorum guard replicas are
managed by the operator the cluster topology will remain in a safe configuration (odd number of members).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does odd number of members hold true? There was a comment above about ensuring that num etcd replicas == num quorum guard, but this may not always be odd when there's a learner "learning"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we surge up to 4 there will be a window where we have an even number of nodes for a short duration. I will call that out and reasons for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still outstanding.

enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved

### Goals

- provide safe scale up of etcd cluster using raft learners.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vertical scaling is a capability we need for managed platforms. Is this scale up in support of changing instance types for the underlying master? And can be followed by a graceful and clean scale down of older masters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes learner support allows etcd to safely scale up new members. It does this by:

  • Allowing a new etcd member to join the cluster without impacting quorum as a non voting member.
  • The new member can not be promoted to a voting member until its log is in sync with the leader. This reduces exposure to data loss.

Both of those steps are critical for vertical scaling, protecting quorum and protecting data loss. The scale-down process will be a separate enhancement but this enhancement is a dependency.

enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2021

@hexfusion: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 22ad8ce link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Proposed format of endpoints `<etcd member ID>: InternalIP`
```yaml
data:
13c8677970c567e2: 10.0.185.161,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the commas intentional here? It's not clear from L115

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that is a typo thanks

enhancements/etcd/scaling-etcd-with-raft-learners.md Outdated Show resolved Hide resolved

1. As a cluster-admin I want to scale up etcd without fear of quorum loss.

2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd.
2. As a cluster-admin I want to be able to replace failed control-plane nodes without a deep understanding of etcd.


We add new metric figures to the etcd dashboard in the console:

1. include membership status over time `is_leader`, `is_learner` and `has_leader`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is is_leader vs has_leader? Does has_leader mean that it's a follower (voting member)? Would is_follower be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are currently metrics exposed by etcd. I will add a reference link to the doc.

https://github.com/etcd-io/etcd/blob/1e46145b29764f9e65f4164bb777fbf275a63d8a/server/etcdserver/metrics.go#L35

via `MemberList`. This allows for only pending or existing etcd members to be reported as endpoints. While the
endpoint at the time of publishing could be a learner or unstarted the client balancer will ensure service.

A small change to the format of the key is also proposed to provide the cluster with more details on the member
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change that can be made transparently without breaking anything during upgrades? This is an internal API right?

Copy link
Contributor Author

@hexfusion hexfusion Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the key itself is not used by OpenShift today

6. Scale CVO and `cluster-etcd-operator` back up.
7. Force a new revision.

### Test 1: Single Learner Scale UP 5GB State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5GB feels like a lot of data, do we have a rough estimate of the average OCP etcd data size? Trying to work out if this is a mean case or worst case kind of estimate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to dig that up through telemetry but the goal was to show a larger production cluster usecase. My guess is that most clusters in the fleet are 3GB or less.

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM so far.
Just have a question on how the desired membership is determined.

`cluster-etcd-operator`.

To populate this value the controller will read etcd-endpoints ConfigMap. This aligns scaling across the controllers.
new revisions of the static-pod controller will also use this configmap as the source of truth for scaling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
new revisions of the static-pod controller will also use this configmap as the source of truth for scaling.
New revisions of the static-pod controller will also use this configmap as the source of truth for scaling.

Comment on lines +246 to +247
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are
part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are
part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old.
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are
part of the existing quorum, the intent becomes clear that we will add the new Nodes to the cluster while removing the old.

I'm probably missing something but just to understand better, who is setting the desired cluster membership in this scenario? Or where is that being set?

I'm trying to understand how we know whether to vertically or horizontally scale when the cluster member controller sees new nodes. Right now the clustermember controller will just add static pods from all new nodes as members, but with this design, are we going to infer from the node count of 6 that the desired cluster membership comprises of the pods on the 3 new nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. today vertical scaling is not supported as the desired state of the control-plane is considered to be immutable. So in the context of the scale-down feature. There is an assumption that new nodes will replace old nodes and we will reconcile to the desired controlPlane replica count defined in the install-config.

So a 3 node cluster if you added 3 new nodes etcd will eventually migrate membership to the new nodes. But something missing from this doc is what happens if you add 2 nodes. Then later add another, this is a great test scenario.

I believe we will need to use FIFO approach with scale down and replace the oldest node first. I will make sure to describe this more completely in that enhancement. For this enhancement, I think its out of scope.

Note: if in the future horizontal scaling is supported we will need an API to describe the desired control plane replicas. In this design we would use that state as the desired cluster size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR the install-config we read from is embedded into kube-system/cluster-config-v1 configmap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying.

And I guess this is more of a question for the scaling down proposal but this makes me think we're keeping a history of nodes that were previously part of the cluster (not sure if etcd already keeps track of membership history). Otherwise we could end up adding a previously removed node back into the cluster if it becomes visible to the clustermember controller.

### Quorum Guard Controller

The `Quorum Guard Controller` ensures that the PDB deployment reflects the desired state of the cluster. To do that
it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as
it must understand the desired control plane replicas which is consumed from the install-config. Today as soon as

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 3, 2021
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Dec 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants