-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-236: etcd: scaling etcd with raft learners #920
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
6129ae5
to
79dc354
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! 🎉
|
||
- e2e with | ||
|
||
1. add dangling etcd learner member which has not been started and not promoted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how will we prevent a learner from being promoted? Otherwise sounds great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea is to use IPTables to block peer traffic tcp
port 2380
so the leader can't update the learner.
/uncc |
We will add the following critical alerts: | ||
|
||
- alert about learner member which has not been promoted or started > 30s | ||
- alert if the number of etcd members is not equal to the number of quorum-guard pods > 30s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?
Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a defined term (perhaps we can create one) for talking about all members vs only those which are voting members?
members of quorum vs non voting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does etcd quorum guard need to cover the learner? Or does that only need to to cover the member once it is promotable?
The idea is to ensure all voting members are sane and that the expected number of voting members is expected. So I believe in this case PDB should be degraded while the learner scales up. Because if quorum-guard was scheduled the expectation is for etcd to scale up as well. A failure to scale would be directly observed in a failure of PDB tolerations. IE don't drain a node if the learner has not yet been promoted feels sane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that makes sense, might be worth adding that context into the enhancement as it wasn't obvious to me why learners still need quorum guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still might need more details but I did add more context. WIll leave this open for now.
|
||
- **Split brain**: The cluster gates starting of the etcd process on a verification process based on the cluster | ||
member id. This ensures that each revision has an explicit expected membership. Because quorum guard replicas are | ||
managed by the operator the cluster topology will remain in a safe configuration (odd number of members). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does odd number of members
hold true? There was a comment above about ensuring that num etcd replicas == num quorum guard, but this may not always be odd when there's a learner "learning"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we surge up to 4 there will be a window where we have an even number of nodes for a short duration. I will call that out and reasons for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still outstanding.
|
||
### Goals | ||
|
||
- provide safe scale up of etcd cluster using raft learners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vertical scaling is a capability we need for managed platforms. Is this scale up in support of changing instance types for the underlying master? And can be followed by a graceful and clean scale down of older masters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes learner support allows etcd to safely scale up new members. It does this by:
- Allowing a new etcd member to join the cluster without impacting quorum as a non voting member.
- The new member can not be promoted to a voting member until its log is in sync with the leader. This reduces exposure to data loss.
Both of those steps are critical for vertical scaling, protecting quorum and protecting data loss. The scale-down process will be a separate enhancement but this enhancement is a dependency.
3b50681
to
ed9f172
Compare
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
ed9f172
to
22ad8ce
Compare
@hexfusion: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Proposed format of endpoints `<etcd member ID>: InternalIP` | ||
```yaml | ||
data: | ||
13c8677970c567e2: 10.0.185.161, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the commas intentional here? It's not clear from L115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that is a typo thanks
|
||
1. As a cluster-admin I want to scale up etcd without fear of quorum loss. | ||
|
||
2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. As a cluster-admin I want to be able to replace failed control-place nodes without a deep understanding of etcd. | |
2. As a cluster-admin I want to be able to replace failed control-plane nodes without a deep understanding of etcd. |
|
||
We add new metric figures to the etcd dashboard in the console: | ||
|
||
1. include membership status over time `is_leader`, `is_learner` and `has_leader`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is is_leader
vs has_leader
? Does has_leader
mean that it's a follower (voting member)? Would is_follower
be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are currently metrics exposed by etcd. I will add a reference link to the doc.
via `MemberList`. This allows for only pending or existing etcd members to be reported as endpoints. While the | ||
endpoint at the time of publishing could be a learner or unstarted the client balancer will ensure service. | ||
|
||
A small change to the format of the key is also proposed to provide the cluster with more details on the member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change that can be made transparently without breaking anything during upgrades? This is an internal API right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the key itself is not used by OpenShift today
6. Scale CVO and `cluster-etcd-operator` back up. | ||
7. Force a new revision. | ||
|
||
### Test 1: Single Learner Scale UP 5GB State |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5GB feels like a lot of data, do we have a rough estimate of the average OCP etcd data size? Trying to work out if this is a mean case or worst case kind of estimate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to dig that up through telemetry but the goal was to show a larger production cluster usecase. My guess is that most clusters in the fleet are 3GB or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM so far.
Just have a question on how the desired membership is determined.
`cluster-etcd-operator`. | ||
|
||
To populate this value the controller will read etcd-endpoints ConfigMap. This aligns scaling across the controllers. | ||
new revisions of the static-pod controller will also use this configmap as the source of truth for scaling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
new revisions of the static-pod controller will also use this configmap as the source of truth for scaling. | |
New revisions of the static-pod controller will also use this configmap as the source of truth for scaling. |
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are | ||
part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are | |
part of the existing quorum intent becomes clear that we will add the new Nodes to the cluster while removing the old. | |
the desired cluster membership is known as soon as possible. For example if we observe 6 Nodes, three of which are | |
part of the existing quorum, the intent becomes clear that we will add the new Nodes to the cluster while removing the old. |
I'm probably missing something but just to understand better, who is setting the desired cluster membership in this scenario? Or where is that being set?
I'm trying to understand how we know whether to vertically or horizontally scale when the cluster member controller sees new nodes. Right now the clustermember controller will just add static pods from all new nodes as members, but with this design, are we going to infer from the node count of 6 that the desired cluster membership comprises of the pods on the 3 new nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. today vertical scaling is not supported as the desired state of the control-plane is considered to be immutable. So in the context of the scale-down feature. There is an assumption that new nodes will replace old nodes and we will reconcile to the desired controlPlane replica count defined in the install-config.
So a 3 node cluster if you added 3 new nodes etcd will eventually migrate membership to the new nodes. But something missing from this doc is what happens if you add 2 nodes. Then later add another, this is a great test scenario.
I believe we will need to use FIFO approach with scale down and replace the oldest node first. I will make sure to describe this more completely in that enhancement. For this enhancement, I think its out of scope.
Note: if in the future horizontal scaling is supported we will need an API to describe the desired control plane replicas. In this design we would use that state as the desired cluster size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR the install-config we read from is embedded into kube-system/cluster-config-v1
configmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying.
And I guess this is more of a question for the scaling down proposal but this makes me think we're keeping a history of nodes that were previously part of the cluster (not sure if etcd already keeps track of membership history). Otherwise we could end up adding a previously removed node back into the cluster if it becomes visible to the clustermember controller.
### Quorum Guard Controller | ||
|
||
The `Quorum Guard Controller` ensures that the PDB deployment reflects the desired state of the cluster. To do that | ||
it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
it must understand the desired control plane replicas which is consumes from the install-config. Today as soon as | |
it must understand the desired control plane replicas which is consumed from the install-config. Today as soon as |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR outlines an enhancement to change scale-up operations of the etcd cluster to use raft learners. This work will assist with
Control Plane Scaling and Recovery
[1].POC: openshift/cluster-etcd-operator#682
[1] https://issues.redhat.com/browse/OCPPLAN-5712