-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve membership reconfiguration test coverage #9150
Comments
@jpbetz You might be interested in improving https://github.com/coreos/etcd/blob/master/integration/cluster_test.go. We DO have scaling up/down membership tests, but still uses v2 API. First good step would be using v3 API and make sure current test suites still pass. |
This looks like the right place to start. Thanks @gyuho. We had a discussion about a potential 1->2 scale up issue that is purely theoretical, but that I'd like to review: If a newly added member can be briefly unavailable right after a cluster is scaled up to include it--maybe due to replication of a large dataset--then the 1->2 scale up case would be problematic, because for that particular size change, the newly added member instantly becomes an essential member of the cluster, and if it is unavailable, the cluster is unavailable. In theory this is only for 1->2 scale ups, because for 2->3, 3->4, ..., even if the newly added member is briefly unavailable, the cluster can make progress on the RAFT log so long as network and all other nodes remain healthy. |
@jpbetz I think your thought is on the right track. The idea behind raft is that the leader needs to make sure that an entry must be committed/agreed in a majority (2/n + 1) of nodes. As long as majority nodes are up and running, then etcd is operational. So from 1 - > 2 case, the majority of 2 node is 2 nodes (2/2 + 1 = 2). then etcd cluster requires 2 nodes to be "running" in order to be operational. However, from 2 -> 3. The majority is 2 (3/2 + 1 = 2). The the etcd cluster will be operationally regardless if the third member joins or not since there are 2 nodes already "running". Hence, the new third member can be briefly unavailable without causing the etcd cluster to be unavailable. This is same for 3->4, .... When I say "running", it means that etcd nodes can talk to each other and are able to elect an leader. |
Will be addressed with learner feature. |
Membership reconfiguration is critical in operating etcd cluster, which etcd-operator heavily depends on. Although we already have a fair amount of tests around cluster APIs (member add/remove), we do not test every possible configurations. I doubt current member APIs have any serious bugs but always good to have edge cases covered and proactively prevent new bugs for future development.
Some of the missing test scenarios:
integration
cluster API tests with v3 API(e2e: improve Member TLS test coverage. #9184)
(protect membership change RPCs with auth #6903)
The text was updated successfully, but these errors were encountered: