Improve membership reconfiguration test coverage #9150

gyuho · 2018-01-16T21:44:59Z

Membership reconfiguration is critical in operating etcd cluster, which etcd-operator heavily depends on. Although we already have a fair amount of tests around cluster APIs (member add/remove), we do not test every possible configurations. I doubt current member APIs have any serious bugs but always good to have edge cases covered and proactively prevent new bugs for future development.

Some of the missing test scenarios:

clean up integration cluster API tests with v3 API
member add/remove/list with TLS enabled, using v3 API
(e2e: improve Member TLS test coverage. #9184)
member add/remove/list with auth enabled, using v3 API
(protect membership change RPCs with auth #6903)
membership reconfiguration in functional testing
- functional: add membership reconfiguration tests #9564
member add/remove/list with DNS/SRV records, using v3 API
restore seed member from snapshot, add 2 more members to set up 3-node cluster
- *: add snapshot package #9118
restore seed member from snapshot, add 2 more members to set up 3-node cluster in functional testing
- functional: simulate quorum disaster #9565
client balancer failover mechanism under cluster reconfiguration
- e.g. seed member becomes unavailable when a new member is added; client should retry until quorum gets re-established with the new member successfully joining cluster.

The text was updated successfully, but these errors were encountered:

gyuho · 2018-01-17T21:20:46Z

@jpbetz You might be interested in improving https://github.com/coreos/etcd/blob/master/integration/cluster_test.go.

We DO have scaling up/down membership tests, but still uses v2 API.

First good step would be using v3 API and make sure current test suites still pass.

jpbetz · 2018-01-17T22:06:06Z

This looks like the right place to start. Thanks @gyuho.

We had a discussion about a potential 1->2 scale up issue that is purely theoretical, but that I'd like to review: If a newly added member can be briefly unavailable right after a cluster is scaled up to include it--maybe due to replication of a large dataset--then the 1->2 scale up case would be problematic, because for that particular size change, the newly added member instantly becomes an essential member of the cluster, and if it is unavailable, the cluster is unavailable. In theory this is only for 1->2 scale ups, because for 2->3, 3->4, ..., even if the newly added member is briefly unavailable, the cluster can make progress on the RAFT log so long as network and all other nodes remain healthy.

fanminshi · 2018-01-22T22:11:21Z

@jpbetz I think your thought is on the right track. The idea behind raft is that the leader needs to make sure that an entry must be committed/agreed in a majority (2/n + 1) of nodes. As long as majority nodes are up and running, then etcd is operational. So from 1 - > 2 case, the majority of 2 node is 2 nodes (2/2 + 1 = 2). then etcd cluster requires 2 nodes to be "running" in order to be operational. However, from 2 -> 3. The majority is 2 (3/2 + 1 = 2). The the etcd cluster will be operationally regardless if the third member joins or not since there are 2 nodes already "running". Hence, the new third member can be briefly unavailable without causing the etcd cluster to be unavailable. This is same for 3->4, ....

When I say "running", it means that etcd nodes can talk to each other and are able to elect an leader.

gyuho · 2018-09-17T00:57:39Z

Will be addressed with learner feature.

gyuho changed the title ~~Improve member reconfiguration test coverage~~ Improve membership reconfiguration test coverage Jan 16, 2018

xiang90 added the area/testing label Jan 17, 2018

gyuho mentioned this issue Jan 22, 2018

Improve e2e test coverage #9149

Closed

hexfusion mentioned this issue Jan 24, 2018

e2e: SRV discovery support #9215

Closed

gyuho added the area/functional-testing label Apr 9, 2018

This was referenced Apr 9, 2018

functional-tester: clean up, handle Operation_SIGQUIT_ETCD_AND_REMOVE_DATA #9548

Merged

functional: add membership reconfiguration tests #9564

Merged

functional: simulate quorum disaster #9565

Merged

gyuho closed this as completed Sep 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve membership reconfiguration test coverage #9150

Improve membership reconfiguration test coverage #9150

gyuho commented Jan 16, 2018 •

edited

Loading

gyuho commented Jan 17, 2018 •

edited

Loading

jpbetz commented Jan 17, 2018 •

edited

Loading

fanminshi commented Jan 22, 2018

gyuho commented Sep 17, 2018

Improve membership reconfiguration test coverage #9150

Improve membership reconfiguration test coverage #9150

Comments

gyuho commented Jan 16, 2018 • edited Loading

gyuho commented Jan 17, 2018 • edited Loading

jpbetz commented Jan 17, 2018 • edited Loading

fanminshi commented Jan 22, 2018

gyuho commented Sep 17, 2018

gyuho commented Jan 16, 2018 •

edited

Loading

gyuho commented Jan 17, 2018 •

edited

Loading

jpbetz commented Jan 17, 2018 •

edited

Loading