Data race caused by `watchServiceAddrs` #6351

rleungx · 2023-04-20T08:17:33Z

Flaky Test

Which jobs are failing

==================
WARNING: DATA RACE
Read at 0x00c00364e8d0 by goroutine 73881:
  runtime.mapaccess2_faststr()
      /opt/hostedtoolcache/go/1.20.1/x64/src/runtime/map_faststr.go:108 +0x0
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).watchServiceAddrs()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:238 +0xd92
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).startWatchLoop()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:190 +0x85d
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap.func2()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0x39

Previous write at 0x00c00364e8d0 by goroutine 67981:
  runtime.mapdelete_faststr()
      /opt/hostedtoolcache/go/1.20.1/x64/src/runtime/map_faststr.go:301 +0x0
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).watchServiceAddrs()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:239 +0xe1b
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).startWatchLoop()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:190 +0x85d
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap.func2()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0x39

Goroutine 73881 (running) created at:
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0xa39
  github.com/tikv/pd/server/cluster.(*RaftCluster).Start()
      /home/runner/work/pd/pd/server/cluster/cluster.go:278 +0x384
  github.com/tikv/pd/server.(*Server).createRaftCluster()
      /home/runner/work/pd/pd/server/server.go:706 +0x153
  github.com/tikv/pd/server.(*Server).campaignLeader()
      /home/runner/work/pd/pd/server/server.go:1519 +0x1d86
  github.com/tikv/pd/server.(*Server).leaderLoop()
      /home/runner/work/pd/pd/server/server.go:1445 +0x1184
  github.com/tikv/pd/server.(*Server).startServerLoop.func1()
      /home/runner/work/pd/pd/server/server.go:565 +0x39

Goroutine 67981 (running) created at:
  github.com/tikv/pd/pkg/keyspace.(*GroupManager).Bootstrap()
      /home/runner/work/pd/pd/pkg/keyspace/tso_keyspace_group.go:138 +0xa39
  github.com/tikv/pd/server/cluster.(*RaftCluster).Start()
      /home/runner/work/pd/pd/server/cluster/cluster.go:278 +0x384
  github.com/tikv/pd/server.(*Server).bootstrapCluster()
      /home/runner/work/pd/pd/server/server.go:688 +0x1ba4
  github.com/tikv/pd/server.(*GrpcServer).Bootstrap()
      /home/runner/work/pd/pd/server/grpc_service.go:291 +0x4be
  github.com/tikv/pd/tests.(*TestServer).BootstrapCluster()
      /home/runner/work/pd/pd/tests/cluster.go:396 +0x644
  github.com/tikv/pd/tests/integrations/mcs/tso.(*CommonTestSuite).SetupSuite()
      /home/runner/work/pd/pd/tests/integrations/mcs/tso/server_test.go:350 +0x484
  github.com/stretchr/testify/suite.Run()
      /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.8.2/suite/suite.go:154 +0x5d7
  github.com/tikv/pd/tests/integrations/mcs/tso.TestCommonTestSuite()
      /home/runner/work/pd/pd/tests/integrations/mcs/tso/server_test.go:334 +0x44
  testing.tRunner()
      /opt/hostedtoolcache/go/1.20.1/x64/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /opt/hostedtoolcache/go/1.20.1/x64/src/testing/testing.go:1629 +0x47
==================

CI link

https://github.com/tikv/pd/actions/runs/4751210932/jobs/8440122477

Reason for failure (if possible)

Anything else

The text was updated successfully, but these errors were encountered:

binshi-bing · 2023-04-22T00:30:00Z

@rleungx the root cause isn't "Data race caused by watchServiceAddrs".

The real root cause is that the multi-thread locking issue in RaftCluster.Start. RaftCluster.Start can be invoked concurrently in different routines -- one is from becoming new leader; another is Server.BootStrap RPC called by test or other callers. Both routines check that RaftCluster isn't running, so both proceed to c.Lock and only one enters the critical section and complete all the start work including GroupManager.Bootstrap in which watchServiceAddrs could dynamically update the serverRegistryMap; after the first one exits the critical section, the second one enters and calls GroupManager.Bootstrap() again and it will concurrently update serverRegistryMap with the previously created loop. The fundermal issue is that the second one should use double-checked locking to check again to see if RaftCluster is running. If yes, then exit the critical section without doing anything.

Thanks for the serverRegistryMap we added which reveals this long running bug to which not sure how much impact it brought before.

// Start starts a cluster.
func (c *RaftCluster) Start(s Server) error {
if c.IsRunning() {
log.Warn("raft cluster has already been started")
return nil
}

c.Lock()
defer c.Unlock()

     // we should check if c. IsRunning() if it's running return here.

Because of serverRegistryMap we added, it reveals this bug existing for long time.

rleungx added the type/ci The issue is related to CI. label Apr 20, 2023

rleungx mentioned this issue Apr 20, 2023

*: fix startWatchLoop leak #6352

Merged

binshi-bing self-assigned this Apr 22, 2023

binshi-bing mentioned this issue Apr 22, 2023

mcs: fix duplicate start of RaftCluster. #6358

Merged

ti-chi-bot bot closed this as completed in #6358 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race caused by `watchServiceAddrs` #6351

Data race caused by `watchServiceAddrs` #6351

rleungx commented Apr 20, 2023

binshi-bing commented Apr 22, 2023 •

edited

Loading

Data race caused by watchServiceAddrs #6351

Data race caused by watchServiceAddrs #6351

Comments

rleungx commented Apr 20, 2023

Flaky Test

Which jobs are failing

CI link

Reason for failure (if possible)

Anything else

binshi-bing commented Apr 22, 2023 • edited Loading

Data race caused by `watchServiceAddrs` #6351

Data race caused by `watchServiceAddrs` #6351

binshi-bing commented Apr 22, 2023 •

edited

Loading