Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake TestMaxLearnerInCluster #16078

Closed
tjungblu opened this issue Jun 14, 2023 · 6 comments
Closed

Flake TestMaxLearnerInCluster #16078

tjungblu opened this issue Jun 14, 2023 · 6 comments

Comments

@tjungblu
Copy link
Contributor

tjungblu commented Jun 14, 2023

Which github workflows are flaking?

test (linux-amd64-integration-2-cpu)

Which tests are flaking?

TestMaxLearnerInCluster

Github Action link

https://github.com/etcd-io/etcd/actions/runs/5264795327/jobs/9516550972

Reason for failure (if possible)

15m test timeout reached, client test code seems stuck on MemberAdd.

2023-06-14T08:41:08.8695469Z 	/home/runner/work/etcd/etcd/client/v3/retry_interceptor.go:61 +0x611
2023-06-14T08:41:08.8695943Z google.golang.org/grpc.(*ClientConn).Invoke(0xc000155c00, {0x1970ae8, 0xc00004e1c8}, {0x170d291, 0x1f}, {0x169b260, 0xc00043d780}, {0x16a2720, 0xc000d2c690}, {0x2135260, ...})
2023-06-14T08:41:08.8696408Z 	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.51.0/call.go:35 +0x26e
2023-06-14T08:41:08.8696847Z go.etcd.io/etcd/api/v3/etcdserverpb.(*clusterClient).MemberAdd(0xc000d90420, {0x1970ae8, 0xc00004e1c8}, 0x0?, {0x2135260, 0x3, 0x3})
2023-06-14T08:41:08.8697296Z 	/home/runner/work/etcd/etcd/api/etcdserverpb/rpc.pb.go:7129 +0xf5
2023-06-14T08:41:08.8697745Z go.etcd.io/etcd/client/v3.(*retryClusterClient).MemberAdd(0xc000ac4de0, {0x1970ae8, 0xc00004e1c8}, 0xc000d908e0?, {0x2135260, 0x3, 0x3})
2023-06-14T08:41:08.8698169Z 	/home/runner/work/etcd/etcd/client/v3/retry.go:171 +0x96
2023-06-14T08:41:08.8698575Z go.etcd.io/etcd/client/v3.(*cluster).memberAdd(0xc00345aa80, {0x1970ae8, 0xc00004e1c8}, {0xc000795520?, 0x1, 0x1}, 0x0)
2023-06-14T08:41:08.8698981Z 	/home/runner/work/etcd/etcd/client/v3/cluster.go:94 +0x1c2
2023-06-14T08:41:08.8699363Z go.etcd.io/etcd/client/v3.(*cluster).MemberAdd(0x1724b8c?, {0x1970ae8, 0xc00004e1c8}, {0xc000795520, 0x1, 0x1})
2023-06-14T08:41:08.8699758Z 	/home/runner/work/etcd/etcd/client/v3/cluster.go:77 +0x78
2023-06-14T08:41:08.8700162Z go.etcd.io/etcd/tests/v3/integration/clientv3.TestMaxLearnerInCluster(0xc0006d8b60)
2023-06-14T08:41:08.8700613Z 	/home/runner/work/etcd/etcd/tests/integration/clientv3/cluster_test.go:424 +0x9b8
2023-06-14T08:41:08.8700957Z testing.tRunner(0xc0006d8b60, 0x176a2e0)
2023-06-14T08:41:08.8701318Z 	/opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1446 +0x217
2023-06-14T08:41:08.8701620Z created by testing.(*T).Run
2023-06-14T08:41:08.8701949Z 	/opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1493 +0x75e

Server side seems stuck on:

2023-06-14T08:41:08.8815491Z goroutine 5734 [select, 14 minutes]:
2023-06-14T08:41:08.8815803Z go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).configure(0xc0001f3500, {0x1970b58, 0xc0030417d0?}, {0x0, 0xd09043b47b5cbaf2, {0xc00300fc00, 0x40, 0x40}, 0xade588b903121c09})
2023-06-14T08:41:08.8816138Z 	/home/runner/work/etcd/etcd/server/etcdserver/server.go:1662 +0x2d3
2023-06-14T08:41:08.8816431Z go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).AddMember(0x483734?, {0x1970b58, 0xc0030417d0}, {0xd09043b47b5cbaf2, {{0xc000795870, 0x1, 0x1}, 0x0}, {{0x0, 0x0}, ...}})
2023-06-14T08:41:08.8816634Z 	/home/runner/work/etcd/etcd/server/etcdserver/server.go:1332 +0x1f5
2023-06-14T08:41:08.8816892Z go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*ClusterServer).MemberAdd(0xc000013908, {0x1970b58, 0xc0030417d0}, 0xc000e260c0)
2023-06-14T08:41:08.8817116Z 	/home/runner/work/etcd/etcd/server/etcdserver/api/v3rpc/member.go:54 +0x1f0

with log message:

2023-06-14T08:41:08.9401977Z     cluster.go:481: waitMembersForLeader found leader. Member: 0 lead: 37871f39d0b1ade5
2023-06-14T08:41:08.9402264Z     cluster.go:426: WaitMembersForLeader succeeded. Cluster leader index: 0
2023-06-14T08:41:08.9403049Z     logger.go:130: 2023-06-14T08:26:06.355Z	INFO	m0.raft	37871f39d0b1ade5 switched to configuration voters=(4001201127116877285 4994919547234222604 11292275827355198234) learners=(10526154868529034699 11206544113628367593)	{"member": "m0"}
2023-06-14T08:41:08.9404057Z     logger.go:130: 2023-06-14T08:26:06.355Z	INFO	m0	applied a configuration change through raft	{"member": "m0", "local-member-id": "37871f39d0b1ade5", "raft-conf-change": "ConfChangeAddLearnerNode", "raft-conf-change-node-id": "7ff05fd7e9c464b5"}
2023-06-14T08:41:08.9405202Z     logger.go:130: 2023-06-14T08:26:06.356Z	WARN	m0.client	retrying of unary invoker failed	{"member": "m0", "target": "etcd-endpoints://0xc000fc4b40/localhost:m0", "method": "/etcdserverpb.Cluster/MemberAdd", "attempt": 0, "error": "rpc error: code = FailedPrecondition desc = etcdserver: too many learner members in cluster"}
2023-06-14T08:41:08.9406839Z     logger.go:130: 2023-06-14T08:26:06.356Z	INFO	m0.raft	37871f39d0b1ade5 ignoring conf change {ConfChangeAddNode 15028586398978194162 [123 34 105 100 34 58 49 53 48 50 56 53 56 54 51 57 56 57 55 56 49 57 52 49 54 50 44 34 112 101 101 114 85 82 76 115 34 58 91 34 104 116 116 112 58 47 47 49 50 55 46 48 46 48 46 49 58 51 52 53 51 34 93 125] 12530571866430118921} at config voters=(4001201127116877285 4994919547234222604 11292275827355198234) learners=(10526154868529034699 11206544113628367593): possible unapplied conf change at index 11 (applied to 10)	{"member": "m0"}
2023-06-14T08:41:08.9407623Z     logger.go:130: 2023-06-14T08:26:06.357Z	INFO	m2.raft	9cb63909333ca31a switched to configuration voters=(4001201127116877285 4994919547234222604 11292275827355198234) learners=(10526154868529034699 11206544113628367593)	{"member": "m2"}
2023-06-14T08:41:08.9408385Z     logger.go:130: 2023-06-14T08:26:06.358Z	INFO	m1.raft	455184dd2e8d020c switched to configuration voters=(4001201127116877285 4994919547234222604 11292275827355198234) learners=(10526154868529034699 11206544113628367593)	{"member": "m1"}
2023-06-14T08:41:08.9409368Z     logger.go:130: 2023-06-14T08:26:06.359Z	WARN	m0	failed to reach the peer URL	{"member": "m0", "address": "http://127.0.0.1:1230/version", "remote-member-id": "92146a153d6bf5cb", "error": "Get \"http://127.0.0.1:1230/version\": dial tcp 127.0.0.1:1230: connect: connection refused"}
@tjungblu
Copy link
Contributor Author

I'm not sure whether this is an idempotency issue (AddMember is called twice, second invocation fails even though the first one succeeds) or whether this is a more general problem with the config change validation.

Almost certainly, it shouldn't get stuck on the select in
https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1662-L1681

Also not easy to repro locally, even with limiting to a single core like this:

taskset -c 1 go test -v -failfast -count=1000 -run TestMaxLearnerInCluster ./integration/clientv3/

@tjungblu
Copy link
Contributor Author

similar case in https://github.com/etcd-io/etcd/actions/runs/5140910673/jobs/9252990421#step:5:7330

2023-06-01T06:32:14.7273549Z     logger.go:130: 2023-06-01T06:17:06.203Z	INFO	grpc	[[balancer] base.baseBalancer: handle SubConn state change: 0xc0018c02b8, READY]
2023-06-01T06:32:14.7274473Z     logger.go:130: 2023-06-01T06:17:06.204Z	INFO	m1.raft	3c690db580d6988b switched to configuration voters=(4353025588032084107 15899738151267271259 16060256720361897868) learners=(10797274271818113879 16456074501519138409)	{"member": "m1"}
2023-06-01T06:32:14.7275397Z     logger.go:130: 2023-06-01T06:17:06.205Z	INFO	m0.raft	dca7379bb003025b switched to configuration voters=(4353025588032084107 15899738151267271259 16060256720361897868) learners=(10797274271818113879 16456074501519138409)	{"member": "m0"}
2023-06-01T06:32:14.7276687Z     logger.go:130: 2023-06-01T06:17:06.206Z	INFO	m2.raft	dee17e6572c03f8c switched to configuration voters=(4353025588032084107 15899738151267271259 16060256720361897868) learners=(10797274271818113879 16456074501519138409)	{"member": "m2"}
2023-06-01T06:32:14.7277762Z     logger.go:130: 2023-06-01T06:17:06.205Z	INFO	m0	applied a configuration change through raft	{"member": "m0", "local-member-id": "dca7379bb003025b", "raft-conf-change": "ConfChangeAddLearnerNode", "raft-conf-change-node-id": "d05e18f0bdaa1c1e"}
2023-06-01T06:32:14.7279241Z     logger.go:130: 2023-06-01T06:17:06.206Z	WARN	m0.client	retrying of unary invoker failed	{"member": "m0", "target": "etcd-endpoints://0xc001b1a5a0/localhost:m0", "method": "/etcdserverpb.Cluster/MemberAdd", "attempt": 0, "error": "rpc error: code = FailedPrecondition desc = etcdserver: too many learner members in cluster"}
2023-06-01T06:32:14.7281406Z     logger.go:130: 2023-06-01T06:17:06.207Z	INFO	m1.raft	3c690db580d6988b ignoring conf change {ConfChangeAddNode 10890373823051611623 [123 34 105 100 34 58 49 48 56 57 48 51 55 51 56 50 51 48 53 49 54 49 49 54 50 51 44 34 112 101 101 114 85 82 76 115 34 58 91 34 104 116 116 112 58 47 47 49 50 55 46 48 46 48 46 49 58 51 52 53 51 34 93 125] 169879449637678600} at config voters=(4353025588032084107 15899738151267271259 16060256720361897868) learners=(10797274271818113879 16456074501519138409): possible unapplied conf change at index 11 (applied to 10)	{"member": "m1"}

m1 seems to completely lock up here, similar to the above case where it's stuck on the select.

@chaochn47
Copy link
Member

chaochn47 commented Jun 14, 2023

Hi @tjungblu, I think the test failure symptom is similar to #15528 (comment).

Could you please try manually inject the sleep(1s) before r.Advance() locally to reproduce the issue? If it does reproduces, #15708 should fix this flaky test.

I will push the fix further to be merged ==

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 14, 2023

thanks for the pointer @chaochn47, I've added a sleep statement right before this line:

r.Advance()

I hope this is what you mean?

If yes, then it's not reproducing either in about 100 runs so far. I can see the sleep in effect though, there's plenty of heartbeat delays in the logs. But it really sounds like it could be the culprit. I try to repro this a bit further, otherwise I'm having all eyes on your PR :)

EDIT: spoke to early, it just happened on one run. With 10ms it's much more reproducible than 1s. Great, one thing less to worry about!

I reckon we'll close here in favor of #15528?

@chaochn47
Copy link
Member

I think we can leave the issue open until the fix is merged. No strong preference though.

@chaochn47
Copy link
Member

Fixed by #15708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants