Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs: potential data race in scheduling server when pd leader switch #8538

Closed
lhy1024 opened this issue Aug 15, 2024 · 2 comments · Fixed by #8539
Closed

mcs: potential data race in scheduling server when pd leader switch #8538

lhy1024 opened this issue Aug 15, 2024 · 2 comments · Fixed by #8539
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.3 severity/major type/bug The issue is confirmed as a bug.

Comments

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 15, 2024

Bug Report

What did you do?

run ci https://github.com/tikv/pd/actions/runs/10399701689/job/28799000468

What did you expect to see?

no data race

What did you see instead?

2024-08-15T06:21:45.4937823Z ==================
2024-08-15T06:21:45.4938600Z WARNING: DATA RACE
2024-08-15T06:21:45.4939412Z Write at 0x00c001716b10 by goroutine 14621:
2024-08-15T06:21:45.4940612Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startCluster()
2024-08-15T06:21:45.4941820Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:491 +0xb26
2024-08-15T06:21:45.4942887Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startCluster-fm()
2024-08-15T06:21:45.4943537Z       <autogenerated>:1 +0x47
2024-08-15T06:21:45.4944189Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).campaignLeader()
2024-08-15T06:21:45.4945189Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:298 +0xd0e
2024-08-15T06:21:45.4946072Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).primaryElectionLoop()
2024-08-15T06:21:45.4946979Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:265 +0xb7a
2024-08-15T06:21:45.4947912Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServerLoop.func1()
2024-08-15T06:21:45.4948876Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:168 +0x33
2024-08-15T06:21:45.4949322Z 
2024-08-15T06:21:45.4949525Z Previous read at 0x00c001716b10 by goroutine 14622:
2024-08-15T06:21:45.4950422Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).updateAPIServerMemberLoop()
2024-08-15T06:21:45.4951631Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:216 +0x12fc
2024-08-15T06:21:45.4952498Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServerLoop.func2()
2024-08-15T06:21:45.4953528Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:169 +0x33
2024-08-15T06:21:45.4954032Z 
2024-08-15T06:21:45.4954190Z Goroutine 14621 (running) created at:
2024-08-15T06:21:45.4954847Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServerLoop()
2024-08-15T06:21:45.4955784Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:168 +0x204
2024-08-15T06:21:45.4956668Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServer()
2024-08-15T06:21:45.4957652Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:467 +0x13b4
2024-08-15T06:21:45.4958459Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).Run()
2024-08-15T06:21:45.4959317Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:162 +0x1d4
2024-08-15T06:21:45.4960030Z   github.com/tikv/pd/tests.NewSchedulingTestServer()
2024-08-15T06:21:45.4960758Z       /home/runner/work/pd/pd/tests/testutil.go:186 +0x8e
2024-08-15T06:21:45.4961427Z   github.com/tikv/pd/tests.(*TestSchedulingCluster).AddServer()
2024-08-15T06:21:45.4962650Z       /home/runner/work/pd/pd/tests/scheduling_cluster.go:70 +0x4d2
2024-08-15T06:21:45.4963928Z   github.com/tikv/pd/tests.NewTestSchedulingCluster()
2024-08-15T06:21:45.4965195Z       /home/runner/work/pd/pd/tests/scheduling_cluster.go:48 +0x274
2024-08-15T06:21:45.4966528Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).startCluster()
2024-08-15T06:21:45.4968032Z       /home/runner/work/pd/pd/tests/testutil.go:418 +0xa52
2024-08-15T06:21:45.4969148Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.4970547Z       /home/runner/work/pd/pd/tests/testutil.go:369 +0x2b9
2024-08-15T06:21:45.4971434Z   github.com/pingcap/failpoint.parseTerm()
2024-08-15T06:21:45.4972942Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:149 +0x364
2024-08-15T06:21:45.4974062Z   github.com/pingcap/failpoint.parse()
2024-08-15T06:21:45.4975413Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:126 +0xa5
2024-08-15T06:21:45.4976194Z   github.com/pingcap/failpoint.newTerms()
2024-08-15T06:21:45.4977216Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:98 +0x3e
2024-08-15T06:21:45.4978394Z   github.com/pingcap/failpoint.(*Failpoint).Enable()
2024-08-15T06:21:45.4979503Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoint.go:54 +0x3e
2024-08-15T06:21:45.4980490Z   github.com/pingcap/failpoint.(*Failpoints).Enable()
2024-08-15T06:21:45.4981732Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:105 +0x296
2024-08-15T06:21:45.4982814Z   github.com/pingcap/failpoint.Enable()
2024-08-15T06:21:45.4983942Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:222 +0x134
2024-08-15T06:21:45.4985127Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.4985982Z       /home/runner/work/pd/pd/tests/testutil.go:362 +0x135
2024-08-15T06:21:45.4986608Z   github.com/pingcap/failpoint.parseTerm()
2024-08-15T06:21:45.4987605Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:149 +0x364
2024-08-15T06:21:45.4988405Z   github.com/pingcap/failpoint.parse()
2024-08-15T06:21:45.4989461Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:126 +0xa5
2024-08-15T06:21:45.4990220Z   github.com/pingcap/failpoint.newTerms()
2024-08-15T06:21:45.4991686Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:98 +0x3e
2024-08-15T06:21:45.4993170Z   github.com/pingcap/failpoint.(*Failpoint).Enable()
2024-08-15T06:21:45.4994999Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoint.go:54 +0x3e
2024-08-15T06:21:45.4996515Z   github.com/pingcap/failpoint.(*Failpoints).Enable()
2024-08-15T06:21:45.4998648Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:105 +0x296
2024-08-15T06:21:45.5000135Z   github.com/pingcap/failpoint.Enable()
2024-08-15T06:21:45.5002111Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:222 +0xf6
2024-08-15T06:21:45.5004024Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.5005457Z       /home/runner/work/pd/pd/tests/testutil.go:361 +0xf7
2024-08-15T06:21:45.5006865Z   github.com/tikv/pd/tests/integrations/mcs/scheduling_test.(*apiTestSuite).TestAPIForward()
2024-08-15T06:21:45.5008918Z       /home/runner/work/pd/pd/tests/integrations/mcs/scheduling/api_test.go:102 +0x6f
2024-08-15T06:21:45.5010166Z   runtime.call16()
2024-08-15T06:21:45.5011179Z       /opt/hostedtoolcache/go/1.21.13/x64/src/runtime/asm_amd64.s:747 +0x42
2024-08-15T06:21:45.5012354Z   reflect.Value.Call()
2024-08-15T06:21:45.5013768Z       /opt/hostedtoolcache/go/1.21.13/x64/src/reflect/value.go:380 +0xb5
2024-08-15T06:21:45.5014901Z   github.com/stretchr/testify/suite.Run.func1()
2024-08-15T06:21:45.5015890Z       /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.8.4/suite/suite.go:197 +0x766
2024-08-15T06:21:45.5016631Z   testing.tRunner()
2024-08-15T06:21:45.5017273Z       /opt/hostedtoolcache/go/1.21.13/x64/src/testing/testing.go:1595 +0x261
2024-08-15T06:21:45.5017915Z   testing.(*T).Run.func1()
2024-08-15T06:21:45.5018951Z       /opt/hostedtoolcache/go/1.21.13/x64/src/testing/testing.go:1648 +0x44
2024-08-15T06:21:45.5019375Z 
2024-08-15T06:21:45.5019587Z Goroutine 14622 (running) created at:
2024-08-15T06:21:45.5020318Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServerLoop()
2024-08-15T06:21:45.5021232Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:169 +0x26d
2024-08-15T06:21:45.5022055Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).startServer()
2024-08-15T06:21:45.5022974Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:467 +0x13b4
2024-08-15T06:21:45.5023799Z   github.com/tikv/pd/pkg/mcs/scheduling/server.(*Server).Run()
2024-08-15T06:21:45.5024886Z       /home/runner/work/pd/pd/pkg/mcs/scheduling/server/server.go:162 +0x1d4
2024-08-15T06:21:45.5025991Z   github.com/tikv/pd/tests.NewSchedulingTestServer()
2024-08-15T06:21:45.5027211Z       /home/runner/work/pd/pd/tests/testutil.go:186 +0x8e
2024-08-15T06:21:45.5028370Z   github.com/tikv/pd/tests.(*TestSchedulingCluster).AddServer()
2024-08-15T06:21:45.5029827Z       /home/runner/work/pd/pd/tests/scheduling_cluster.go:70 +0x4d2
2024-08-15T06:21:45.5031263Z   github.com/tikv/pd/tests.NewTestSchedulingCluster()
2024-08-15T06:21:45.5032509Z       /home/runner/work/pd/pd/tests/scheduling_cluster.go:48 +0x274
2024-08-15T06:21:45.5033692Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).startCluster()
2024-08-15T06:21:45.5034884Z       /home/runner/work/pd/pd/tests/testutil.go:418 +0xa52
2024-08-15T06:21:45.5035806Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.5036866Z       /home/runner/work/pd/pd/tests/testutil.go:369 +0x2b9
2024-08-15T06:21:45.5037422Z   github.com/pingcap/failpoint.parseTerm()
2024-08-15T06:21:45.5038519Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:149 +0x364
2024-08-15T06:21:45.5039486Z   github.com/pingcap/failpoint.parse()
2024-08-15T06:21:45.5040477Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:126 +0xa5
2024-08-15T06:21:45.5041240Z   github.com/pingcap/failpoint.newTerms()
2024-08-15T06:21:45.5042312Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:98 +0x3e
2024-08-15T06:21:45.5043144Z   github.com/pingcap/failpoint.(*Failpoint).Enable()
2024-08-15T06:21:45.5044163Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoint.go:54 +0x3e
2024-08-15T06:21:45.5045099Z   github.com/pingcap/failpoint.(*Failpoints).Enable()
2024-08-15T06:21:45.5046462Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:105 +0x296
2024-08-15T06:21:45.5047541Z   github.com/pingcap/failpoint.Enable()
2024-08-15T06:21:45.5049048Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:222 +0x134
2024-08-15T06:21:45.5050801Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.5052181Z       /home/runner/work/pd/pd/tests/testutil.go:362 +0x135
2024-08-15T06:21:45.5053333Z   github.com/pingcap/failpoint.parseTerm()
2024-08-15T06:21:45.5055226Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:149 +0x364
2024-08-15T06:21:45.5056674Z   github.com/pingcap/failpoint.parse()
2024-08-15T06:21:45.5059122Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:126 +0xa5
2024-08-15T06:21:45.5060619Z   github.com/pingcap/failpoint.newTerms()
2024-08-15T06:21:45.5062522Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/terms.go:98 +0x3e
2024-08-15T06:21:45.5064152Z   github.com/pingcap/failpoint.(*Failpoint).Enable()
2024-08-15T06:21:45.5066060Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoint.go:54 +0x3e
2024-08-15T06:21:45.5067583Z   github.com/pingcap/failpoint.(*Failpoints).Enable()
2024-08-15T06:21:45.5069709Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:105 +0x296
2024-08-15T06:21:45.5070013Z   github.com/pingcap/failpoint.Enable()
2024-08-15T06:21:45.5070784Z       /home/runner/go/pkg/mod/github.com/pingcap/failpoint@v0.0.0-20220801062533-2eaa32854a6c/failpoints.go:222 +0xf6
2024-08-15T06:21:45.5071184Z   github.com/tikv/pd/tests.(*SchedulingTestEnvironment).RunTestInAPIMode()
2024-08-15T06:21:45.5071555Z       /home/runner/work/pd/pd/tests/testutil.go:361 +0xf7
2024-08-15T06:21:45.5072150Z   github.com/tikv/pd/tests/integrations/mcs/scheduling_test.(*apiTestSuite).TestAPIForward()
2024-08-15T06:21:45.5072645Z       /home/runner/work/pd/pd/tests/integrations/mcs/scheduling/api_test.go:102 +0x6f
2024-08-15T06:21:45.5072783Z   runtime.call16()
2024-08-15T06:21:45.5073212Z       /opt/hostedtoolcache/go/1.21.13/x64/src/runtime/asm_amd64.s:747 +0x42
2024-08-15T06:21:45.5073360Z   reflect.Value.Call()
2024-08-15T06:21:45.5073730Z       /opt/hostedtoolcache/go/1.21.13/x64/src/reflect/value.go:380 +0xb5
2024-08-15T06:21:45.5074259Z   github.com/stretchr/testify/suite.Run.func1()
2024-08-15T06:21:45.5074804Z       /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.8.4/suite/suite.go:197 +0x766
2024-08-15T06:21:45.5074943Z   testing.tRunner()
2024-08-15T06:21:45.5075374Z       /opt/hostedtoolcache/go/1.21.13/x64/src/testing/testing.go:1595 +0x261
2024-08-15T06:21:45.5075529Z   testing.(*T).Run.func1()
2024-08-15T06:21:45.5076011Z       /opt/hostedtoolcache/go/1.21.13/x64/src/testing/testing.go:1648 +0x44
2024-08-15T06:21:45.5076130Z ==================

What version of PD are you using (pd-server -V)?

@lhy1024 lhy1024 added the type/bug The issue is confirmed as a bug. label Aug 15, 2024
@lhy1024
Copy link
Contributor Author

lhy1024 commented Aug 15, 2024

There is data race between updateAPIServerMemberLoop and primaryElectionLoop

For updateAPIServerMemberLoop, when code reaches to s.IsServing(), the cluster is normal.

But when the code reaches https://github.com/tikv/pd/blob/1b8fc6a950dbe658a0051c7f8228367b85bc9180/pkg/mcs/scheduling/server/server.go#L196-215, the pd cluster switch leader by TestFollowerForward by coincidence, which make that one of these requests in updateAPIServerMemberLoop is suspend but no timeout.

For primaryElectionLoop, no leader in etcd made the scheduling primary lose its lease and tried to campaign.
2024-08-15T06:21:45.1786669Z [2024/08/15 06:20:56.934 +00:00] [INFO] [server.go:329] ["no longer a primary/leader because lease has expired, the scheduling primary/leader will step down"]

For primaryElectionLoop, campaign primary means that it is necessary to start the cluster again, so s.cluster is written again

s.cluster, err = NewCluster(s.Context(), s.persistConfig, s.storage, s.basicCluster, s.hbStreams, s.clusterID, s.checkMembershipCh)

At the same time, updateAPIServerMemberLoop calls s.cluster.SwitchAPIServerLeader, which is data race.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Aug 15, 2024

  updateAPIServerMemberLoop primaryElectionLoop pd server status
t0 pass s.IsServing() isPrimary normal
t1     switch leader
t2 suspend but no timeout lease timeout  
t3 suspend but no timeout campaign  
t4     normal again
t5 read s.clutser write s.cluster  

ti-chi-bot bot added a commit that referenced this issue Dec 12, 2024
close #8538

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.3 severity/major type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

1 participant