diff --git a/enhancements/etcd/etcd-learner-backend-commit.png b/enhancements/etcd/etcd-learner-backend-commit.png new file mode 100644 index 0000000000..b47d8d34b3 Binary files /dev/null and b/enhancements/etcd/etcd-learner-backend-commit.png differ diff --git a/enhancements/etcd/etcd-learner-iowait.png b/enhancements/etcd/etcd-learner-iowait.png new file mode 100644 index 0000000000..62248e6743 Binary files /dev/null and b/enhancements/etcd/etcd-learner-iowait.png differ diff --git a/enhancements/etcd/etcd-learner-rss.png b/enhancements/etcd/etcd-learner-rss.png new file mode 100644 index 0000000000..1ac16ff3a4 Binary files /dev/null and b/enhancements/etcd/etcd-learner-rss.png differ diff --git a/enhancements/etcd/scaling-etcd-with-raft-learners.md b/enhancements/etcd/scaling-etcd-with-raft-learners.md new file mode 100644 index 0000000000..78c607f448 --- /dev/null +++ b/enhancements/etcd/scaling-etcd-with-raft-learners.md @@ -0,0 +1,458 @@ +--- +title: Scaling etcd with Raft Learners +authors: +- "@hexfusion" + reviewers: +- "@deads2k" +- "@lilic" +- "@hasbro17" +- "@joelspeed" +- "@jeremyeder" + approvers: +- "@mfojtik" + creation-date: 2021-10-04 + last-updated: 2021-10-26 + status: implementable + see-also: [] + replaces: [] + superseded-by: [] +--- + +# Scaling etcd with Raft Learners + +## Release Signoff Checklist + +- [X] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in + [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +Over time as clusters live longer and workloads grow the ability to scale the control-plane and replace failed nodes +becomes a critical part of the admins maintenance overhead. Today the `cluster-etcd-operator` manages scaling up of +the etcd cluster. To provide the foundation for initiatives such as scale down and vertical control plane scaling[1]. +The `cluster-etcd-operator` must ensure proper safety mechanisms exist to adjust membership of the etcd cluster. + +Introduced in etcd v3.4 the raft learner[2] provides mitigations which reduces quorum and +stability issues during scaling. A learner is essentially an etcd member which is non voting thus can not impact +quorum but like other members will receive log replications from the leader. + +This enhancement proposes: +- Replacing the default scale up performed by the cluster-etcd-operator to use raft learners. +- Deprecation and removal of the current `discover-etcd-initial-cluster`[3] command and replacing its +functionality with the existing `etcdenvvar` controller. +- Add node selector to PDB deployment allowing the etcd-operator to prescribe a 1:1 relationship of etcd member and +quorum-guard pod. +- Adding a flag to etcd which allows runtime configuration of maximum learners in cluster (--max-learners). Today the +max is 1. +- Add new method to library-go static pod controllers `WithNodeFilter()` which provides the ability for the +etcd-operator to have more granular control over static-pod scaling. + +POC: Functional proof of concept: https://github.com/openshift/cluster-etcd-operator/pull/682 + +[1] https://issues.redhat.com/browse/OCPPLAN-5712 + +[2] https://etcd.io/docs/v3.4/learning/design-learner/ + +[3] https://github.com/openshift/etcd/blob/openshift-4.10/openshift-tools/pkg/discover-etcd-initial-cluster/initial-cluster.go + +## Motivation + +Great demand exists to provide flexible automated scaling of the control-plane in the same way the worker +nodes do today. + +### Goals + +- provide safe scale up of etcd cluster using raft learners. +- add observability to allow admin to clearly diagnose scaling failure. +- reduce divergence from upstream etcd by removing `discover-etcd-initial-cluster`. + +### Non-Goals + +- implement scale down logic for the `cluster-etcd-operator` + +## Proposal + +### User Stories + +1. As a cluster-admin I want to easily and safely vertically scale the control-plane without fear of quorum loss. + +2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd. + +### Monitoring Dashboard + +We add new metric figures to the etcd dashboard in the console: + +1. include membership status over time `is_leader`, `is_learner` and `has_leader`. + +### etcd Endpoints Controller + +The `etcd Endpoints Controller` populates the etcd-endpoints configmap which is consumed by the etcd endpoints +config observer[1] and used to populate the apiserver `storageConfig`. Today this controller loops control-plane +nodes and blindly adds the InternalIP of the node to the list. + +In order for operator to have full control over membership the controller will move to asking etcd directly +via `MemberList`. This allows for only pending or existing etcd members to be reported as endpoints. While the +endpoint at the time of publishing could be a learner or unstarted the client balancer will ensure service. + +A small change to the format of the key is also proposed to provide the cluster with more details on the member +without asking etcd directly. Today the key is base64 encoded value (IP). This value is not currently used at all. +The proposal replaces the existing key format with the hex string etcd Member ID. The Member ID is unique per +cluster unlike the peerURL. This information can be useful in understanding "was this etcd member replaced/removed?". + +Current format of endpoints ConfigMap. : InternalIP +```yaml +data: + MTAuMC4xODUuMTYx: 10.0.185.161 + MTAuMC4xODUuMjEy: 10.0.185.212 + MTAuMC4xOTkuMjM1: 10.0.199.235 +``` + +Proposed format of endpoints `: InternalIP` +```yaml +data: + 13c8677970c567e2: 10.0.185.161, + 85d0b011227abdc2: 10.0.185.212, + 867988777c0cacfc: 10.0.199.235, +``` + +[1] https://github.com/openshift/library-go/blob/release-4.10/pkg/operator/configobserver/etcd/observe_etcd_endpoints.go + +### library-go Static Pod: Node Filter + +`InstallerController` manages scheduling of installer pods and ensuring the state for those pods exists for each +new revision. Today every control-plane node in the cluster is scheduled for every revision. etcd-operator needs +more control over the etcd pods. As described in the changes to the `QuorumGaurdController` the etcd operator will +explicitly describe the nodes which should be used by the installer controller. To facilitate this functionality a +new configuration method `WithNodeFilter(nodeFilterFn func(ctx context.Context) (map[string]bool, error)) +*InstallerController`. +This method will return a map of all control-plane node hostnames with a boolean value if they should be included in +the revision or not. + +The nodeFilterFn map for `cluster-etcd-operator` will be populated by the etcd-endpoints ConfigMap. + +`NodeController` populates the revision status of the nodes. This controller will also gain access to `nodeFilterFn` +to ensure proper status. + +POC: Functional proof of concept: https://github.com/openshift/library-go/pull/1231 + +##### Current NodeController Statuses +| Name | +| ----------- | +| MasterNodeRemoved | +| MasterNodeObserved | +| MasterNodesReady | + +##### Proposed New NodeController Status +| Name | +| ----------- | +| MasterNodesOmitted | + +### etcd ENV Var Controller + +The etcd ENV Var Controller populates etcd runtime configuration which is later consumed by the `target config +controller`. + +`ETCD_INITIAL_CLUSTER` ENV variable is a critical part of etcd scaling process. Before a member has joined the cluster +and received a snapshot from the leader including the cluster membership details from the member bucket the new etcd +member must be able to communicate with its peers. This proposal includes moving population of +`ETCD_INITIAL_CLUSTER` from etcd itself via `discover-etcd-initial-cluster`[1] to the `etcdenvvar controller` in the +`cluster-etcd-operator`. + +To populate this value the controller will read etcd-endpoints ConfigMap. This aligns scaling across the controllers. +new revisions of the static-pod controller will also use this configmap as the source of truth for scaling. + +[1] https://github.com/openshift/etcd/tree/openshift-4.10/openshift-tools/pkg/discover-etcd-initial-cluster + +### Max Learner + +etcd needs the ability to define the maximum number of learners that are allowed in a cluster. Today maxLearners is +hardcoded with a value of 1. During boostrap --max-learners will be set to the desired control-plane replica count. +By using 3 learners to bootstrap an HA cluster we can improve bootstrap complete times by tearing down the bootstrap +etcd much sooner. A learner member is promotable when the learners raft status progress is greater than 90%. + +## Learner Performance + +Performance tests were done which show the MTTR and cost of a single learner being added for scaling. In the case +where multiple learners exist, they will be started serial so perf should be inline with below. + +Test steps: + +1. Populate etcd state to a predetermined size. +2. Scale down CVO and `cluster-etcd-operator`. +3. Choose control-plane node and scale down etcd member on that node. +4. Stop the static-pod by moving it out of `/etc/kubernetes/manifests`. +5. Remove the etcd data-dir `/var/lib/etcd`. +6. Scale CVO and `cluster-etcd-operator` back up. +7. Force a new revision. + +### Test 1: Single Learner Scale UP 5GB State + +**Memory** +```shell +process_resident_memory_bytes{job="etcd"} +``` +Memory was generally stable during the tests, the peer that was being scaled up would of course see an increase in +RSS as the state file increased in size. The returning peer saw ~25-35% increase in memory usage while it was serving +before promotion. The increase in memory lasted about 4 minutes and began to decrease steadily after about 2 minutes. + +![etcd learner RSS](./etcd-learner-rss.png) + +**CPU iowait** +```shell +(sum(irate(node_cpu_seconds_total {mode="iowait"} [2m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu) * 100 +AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ) +``` + +iowait spiked to ~15% on the peer being scaled up and took about 4 minutes to stabilize. + +![etcd learner iowait](./etcd-learner-iowait.png) + +**Disk I/O** +```shell +histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) +``` + +The fsync duration in the test did not reflect high latency because the cluster was not under load. + +![etcd learner backend commit](./etcd-learner-backend-commit.png) + +```shell +histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)) +``` + +Backend commit for the learner being scaled up peaked at its highest bucket (8s) for over a 10 minute period. + + **Promotion Duration** + +Timing on the member being added and promoted is about 60 seconds. + +|RPC | Duration | +|----|-----| +MemberAdd| 13:26:24| +MemberPromote| 13:27:22| + +### Cluster Member Controller + +The cluster member controller manages scaling up etcd during member replacement and bootstrap. This proposal makes +changes to the way this controller manages scale up. Today the controller loops not `Ready` etcd pods and will +attempt to scale up if the etcd is not already part of the cluster membership. + +This proposal moves from observation of etcd Pods to control-plane Nodes. The goal of this change is to ensure that +the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are +part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old. + +This will ensure that the `etcd-endpoints` ConfigMap includes all members in the cluster so that +`ETCD_INITIAL_CLUSTER` can be correctly populated for all future members in a single revision. By having the next +revision contain all future members we have more flexibility with which static pod will be scaled up next. + +`ensureEtcdLearnerPromotion`: This new method will provide the logic necessary to ask the cluster for the list of +members and attempt to promote any learner members which have a log in sync with the leader. + +**Scaling Matrix Manual Scale Down** + +The matrix below shows the cluster node topology and the expected response from scale up. + +|Existing Nodes | Action | Max PDB Replicas| Max Learners | Expected Result | Desired Result | Notes | +|---|---|---|---|---|---|---| +|Bootstrap| Add 3 Nodes | 3 | 3| 3 Member Cluster | 3 Member Cluster| | +|3| Add 1 Node | 4| 1| 4 Member Cluster | 3 Member Cluster | Manual etcd Scale Down Required | +|3| Add 2 Node | 4 | 1 | 4 Member Cluster| 3 Member Cluster| *Manual etcd Scale Down Required | +|3| Add 3 Node | 4 | 1 | 4 Member Cluster | 3 Member Cluster| *Manual etcd Scale Down Required | +`*` Remaining nodes could be rolled out after manual removal + +------------------------ + +**Scaling Matrix Automated Future Scale Down** + +With the addition of automated scale down feature. + +|Existing Nodes | Action | Max PDB Replica| Max Learners | Expected Result | Desired Result | Notes | +|---|---|---|---|---|---|---| +|Bootstrap| Add 3 Nodes | 3 | 3| 3 Member Cluster | 3 Member Cluster| | +|3| Add 1 Node | 4| 1| 3 Member Cluster | 3 Member Cluster | Automated Scale Down | +|3| Add 2 Node | 4 | 2 | 3 Member Cluster| 3 Member Cluster| Rolling Automated Scale Up/Down | +|3| Add 3 Node | 4 | 3 | 3 Member Cluster | 3 Member Cluster| Rolling Automated Scale Up/Down | + +### Quorum Guard Controller + +The `Quorum Guard Controller` ensures that the PDB deployment reflects the desired state of the cluster. To do that +it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as +this controller starts the quorum-guard pods are scheduled to each control-plane node matching the install-config +replica count. + +This proposal allows for a dynamic assignment of quorum-guard pods using `nodeSelector` so that each etcd member +will have a quorum-guard pod assigned to each member. To understand this intent the controller will read the +`etcd-endpoints` ConfigMap and adjust the nodeSelector to only include those nodes mapped to etcd membership. In the +case that a member is removed quorum-guard will adjust the `nodeSelector` and replicas to reflect this change. This +allows in a rolling replacement of nodes to ensure each node has been scaled up before we remove the old nodes while +also ensuring each learner added to the cluster becomes healthy before we scale down. + +NOTE: Until automatic scale-down feature is implemented the controller will tolerate N+1 replicas. Where N is the value +from install-config. Once manual scale down is complete the next member will be started. + +### Risks and Mitigations + +The largest risks to scaling is quorum loss, data loss, and split brain scenario. + +- **Quorum Loss**: Because new members are added as non voting members and can not be promoted to voting members unless +the etcd process starts and the log has been completely replicated and in sync with leader. + + +- **Split brain**: The cluster gates starting of the etcd process on a verification process based on the cluster +member id. This ensures that each revision has an explicit expected membership. Because quorum guard replicas are +managed by the operator the cluster topology will remain in a safe configuration (odd number of members). + + +- **Data loss**: A concern with rolling scaling of etcd with large data files is possible data loss. The MTTR of log +replication is directly tied to the size of the state. If members of the cluster were replaced too quickly in +conjunction of it could be possible although very unlikely that no member has a complete log. Raft learners ensures +this is not possible by waiting for the log to be replicated from the leader before promotion to voting member can +take place. + +## Design Details + +### Open Questions [optional] + +This is where to call out areas of the design that require closure before deciding to implement the +design. For instance, > 1. This requires exposing previously private resources which contain +sensitive information. Can we do this? + +### Test Plan + +- e2e with + + 1. add dangling etcd learner member which has not been started and not promoted. + - verify alert fires + - ensure degraded cluster + 2. scale up and scale down etcd cluster + - ensure stability of etcd cluster during bootstrap. + - verify `ETCD_INTIAL_CLUSTER` remains valid through scaling and no split brain occurs (leader x2). + - inject invalid member into `ETCD_MEMBER_IDS` to ensure `verify membership` blocks rollout. + - replace member on node and ensure `verify membership` properly archives previous data-dir. + 3. verify DR workflow + +### Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal should keep +this high-level with a focus on what signals will be looked at to determine graduation. + +Consider the following in developing the graduation criteria for this enhancement: + +- Maturity levels + - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] + - `Dev Preview`, `Tech Preview`, `GA` in OpenShift +- [Deprecation policy][deprecation-policy] + +Clearly define what graduation means by either linking to the [API doc +definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), or by +redefining what graduation means. + +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is +accessed. + +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ + +**Examples**: These are generalized examples to consider, in addition to the aforementioned +[maturity levels][maturity-levels]. + +#### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +#### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing + +**For non-optional features moving to GA, the graduation criteria must include end to end tests.** + +#### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +### Upgrade / Downgrade Strategy + +If applicable, how will the component be upgraded and downgraded? Make sure this is in the test +plan. + +Consider the following in developing an upgrade/downgrade strategy for this enhancement: +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to + make on upgrade in order to keep previous behavior? +- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to + make on upgrade in order to make use of the enhancement? + +Upgrade expectations: +- Each component should remain available for user requests and workloads during upgrades. Ensure the + components leverage best practices in handling [voluntary + disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to + this should be identified and discussed here. +- Micro version upgrades - users should be able to skip forward versions within a minor release + stream without being required to pass through intermediate versions - i.e. `x.y.N->x.y.N+2` should + work without requiring `x.y.N->x.y.N+1` as an intermediate step. +- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade steps. So, for example, it + is acceptable to require a user running 4.3 to upgrade to 4.5 with a `4.3->4.4` step followed by a + `4.4->4.5` step. +- While an upgrade is in progress, new component versions should continue to operate correctly in + concert with older component versions (aka "version skew"). For example, if a node is down, and an + operator is rolling out a daemonset, the old and new daemonset pods must continue to work + correctly even while the cluster remains in this partially upgraded state for some time. + +Downgrade expectations: +- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is misbehaving, it should be + possible for the user to rollback to `N`. It is acceptable to require some documented manual steps + in order to fully restore the downgraded cluster to its previous state. Examples of acceptable + steps include: + - Deleting any CVO-managed resources added by the new version. The CVO does not currently delete + resources that no longer exist in the target version. + +### Version Skew Strategy + +How will the component handle version skew with other components? What are the guarantees? Make +sure this is in the test plan. + +Consider the following in developing a version skew strategy for this enhancement: +- During an upgrade, we will always have skew among components, how will this impact your work? +- Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How + does an n-2 kubelet without this feature available behave when this feature is used? +- Will any other components on the node change? For example, changes to CSI, CRI or CNI may require + updating that component before the kubelet. + +## Implementation History + +Major milestones in the life cycle of a proposal should be tracked in `Implementation History`. + +## Drawbacks + +The idea is to find the best form of an argument why this enhancement should _not_ be implemented. + +## Alternatives + +Similar to the `Drawbacks` section the `Alternatives` section is used to highlight and record other +possible approaches to delivering the value proposed by an enhancement. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new subproject, repos +requested, GitHub details, and/or testing infrastructure. + +Listing these here allows the community to get the process for these resources started right away.