Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race caused by watchServiceAddrs #6351

Closed
rleungx opened this issue Apr 20, 2023 · 1 comment · Fixed by #6358
Closed

Data race caused by watchServiceAddrs #6351

rleungx opened this issue Apr 20, 2023 · 1 comment · Fixed by #6358
Assignees
Labels
type/ci The issue is related to CI.

Comments

@rleungx
Copy link
Member

rleungx commented Apr 20, 2023

Flaky Test

Which jobs are failing

==================
WARNING: DATA RACE
Read at 0x00c00364e8d0 by goroutine 73881:
  runtime.mapaccess2_faststr()
      /opt/hostedtoolcache/go/1.20.1/x64/src/runtime/map_faststr.go:108 +0x0
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).watchServiceAddrs()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:238 +0xd92
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).startWatchLoop()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:190 +0x85d
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap.func2()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0x39

Previous write at 0x00c00364e8d0 by goroutine 67981:
  runtime.mapdelete_faststr()
      /opt/hostedtoolcache/go/1.20.1/x64/src/runtime/map_faststr.go:301 +0x0
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).watchServiceAddrs()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:239 +0xe1b
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).startWatchLoop()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:190 +0x85d
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap.func2()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0x39

Goroutine 73881 (running) created at:
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0xa39
  github.com/tikv/pd/server/cluster.(*RaftCluster).Start()
      /home/runner/work/pd/pd/server/cluster/cluster.go:278 +0x384
  github.com/tikv/pd/server.(*Server).createRaftCluster()
      /home/runner/work/pd/pd/server/server.go:706 +0x153
  github.com/tikv/pd/server.(*Server).campaignLeader()
      /home/runner/work/pd/pd/server/server.go:1519 +0x1d86
  github.com/tikv/pd/server.(*Server).leaderLoop()
      /home/runner/work/pd/pd/server/server.go:1445 +0x1184
  github.com/tikv/pd/server.(*Server).startServerLoop.func1()
      /home/runner/work/pd/pd/server/server.go:565 +0x39

Goroutine 67981 (running) created at:
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0xa39
  github.com/tikv/pd/server/cluster.(*RaftCluster).Start()
      /home/runner/work/pd/pd/server/cluster/cluster.go:278 +0x384
  github.com/tikv/pd/server.(*Server).bootstrapCluster()
      /home/runner/work/pd/pd/server/server.go:688 +0x1ba4
  github.com/tikv/pd/server.(*GrpcServer).Bootstrap()
      /home/runner/work/pd/pd/server/grpc_service.go:291 +0x4be
  github.com/tikv/pd/tests.(*TestServer).BootstrapCluster()
      /home/runner/work/pd/pd/tests/cluster.go:396 +0x644
  github.com/tikv/pd/tests/integrations/mcs/tso.(*CommonTestSuite).SetupSuite()
      /home/runner/work/pd/pd/tests/integrations/mcs/tso/server_test.go:350 +0x484
  github.com/stretchr/testify/suite.Run()
      /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.8.2/suite/suite.go:154 +0x5d7
  github.com/tikv/pd/tests/integrations/mcs/tso.TestCommonTestSuite()
      /home/runner/work/pd/pd/tests/integrations/mcs/tso/server_test.go:334 +0x44
  testing.tRunner()
      /opt/hostedtoolcache/go/1.20.1/x64/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /opt/hostedtoolcache/go/1.20.1/x64/src/testing/testing.go:1629 +0x47
==================

CI link

https://github.com/tikv/pd/actions/runs/4751210932/jobs/8440122477

Reason for failure (if possible)

Anything else

@rleungx rleungx added the type/ci The issue is related to CI. label Apr 20, 2023
@binshi-bing binshi-bing self-assigned this Apr 22, 2023
@binshi-bing
Copy link
Contributor

binshi-bing commented Apr 22, 2023

@rleungx the root cause isn't "Data race caused by watchServiceAddrs".

The real root cause is that the multi-thread locking issue in RaftCluster.Start. RaftCluster.Start can be invoked concurrently in different routines -- one is from becoming new leader; another is Server.BootStrap RPC called by test or other callers. Both routines check that RaftCluster isn't running, so both proceed to c.Lock and only one enters the critical section and complete all the start work including GroupManager.Bootstrap in which watchServiceAddrs could dynamically update the serverRegistryMap; after the first one exits the critical section, the second one enters and calls GroupManager.Bootstrap() again and it will concurrently update serverRegistryMap with the previously created loop. The fundermal issue is that the second one should use double-checked locking to check again to see if RaftCluster is running. If yes, then exit the critical section without doing anything.

Thanks for the serverRegistryMap we added which reveals this long running bug to which not sure how much impact it brought before.

// Start starts a cluster.
func (c *RaftCluster) Start(s Server) error {
if c.IsRunning() {
log.Warn("raft cluster has already been started")
return nil
}

c.Lock()
defer c.Unlock()

     // we should check if c. IsRunning() if it's running return here.

Because of serverRegistryMap we added, it reveals this bug existing for long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/ci The issue is related to CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants