You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@rleungx the root cause isn't "Data race caused by watchServiceAddrs".
The real root cause is that the multi-thread locking issue in RaftCluster.Start. RaftCluster.Start can be invoked concurrently in different routines -- one is from becoming new leader; another is Server.BootStrap RPC called by test or other callers. Both routines check that RaftCluster isn't running, so both proceed to c.Lock and only one enters the critical section and complete all the start work including GroupManager.Bootstrap in which watchServiceAddrs could dynamically update the serverRegistryMap; after the first one exits the critical section, the second one enters and calls GroupManager.Bootstrap() again and it will concurrently update serverRegistryMap with the previously created loop. The fundermal issue is that the second one should use double-checked locking to check again to see if RaftCluster is running. If yes, then exit the critical section without doing anything.
Thanks for the serverRegistryMap we added which reveals this long running bug to which not sure how much impact it brought before.
// Start starts a cluster.
func (c *RaftCluster) Start(s Server) error {
if c.IsRunning() {
log.Warn("raft cluster has already been started")
return nil
}
c.Lock()
defer c.Unlock()
// we should check if c. IsRunning() if it's running return here.
Because of serverRegistryMap we added, it reveals this bug existing for long time.
Flaky Test
Which jobs are failing
CI link
https://github.com/tikv/pd/actions/runs/4751210932/jobs/8440122477
Reason for failure (if possible)
Anything else
The text was updated successfully, but these errors were encountered: