Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in dispatcher - dispatcher.mu locking vs dispatcher.rpcRW locking #2577

Closed
cyli opened this issue Mar 28, 2018 · 7 comments
Closed

Comments

@cyli
Copy link
Contributor

cyli commented Mar 28, 2018

Just saw the following data race while running tests:

=== RUN   TestManager
time="2018-03-28T21:34:46Z" level=warning msg="no certificate expiration specified, using default" cluster.id=7pkm17saqzx191nef496p4pdj method="(*Server).UpdateRootCA" module=ca testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="Listening for connections" addr="127.0.0.1:52039" proto=tcp testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="Listening for local connections" addr=/tmp/test-socket139892905 proto=unix testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 became follower at term 0" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="newRaft 4477f5b8ab5dd668 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 became follower at term 1" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 is starting a new election at term 1" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 became candidate at term 2" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 received MsgVoteResp from 4477f5b8ab5dd668 at term 2" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="4477f5b8ab5dd668 became leader at term 2" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="raft.node: 4477f5b8ab5dd668 elected leader 4477f5b8ab5dd668 at term 2" module=raft testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="Creating default ingress network" testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="dispatcher starting" module=dispatcher testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="leadership changed from not yet part of a raft cluster to uhs3vt3cw07lsukag7ex1uq21" testHasExternalCA=false testname=TestManager
==================
WARNING: DATA RACE
Write at 0x00c420389cc0 by goroutine 111:
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).Run()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:268 +0x6ef
  github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func3()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:997 +0x1ea

Previous read at 0x00c420389cc0 by goroutine 106:
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).isRunning()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:443 +0x4b
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:1152 +0xc3
  github.com/docker/swarmkit/api.(*authenticatedWrapperDispatcherServer).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:472 +0x17d
  github.com/docker/swarmkit/api.(*raftProxyDispatcherServer).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1728 +0x272
  github.com/docker/swarmkit/api._Dispatcher_Heartbeat_Handler.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1009 +0xa1
  github.com/docker/swarmkit/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xfa
  github.com/docker/swarmkit/manager.New.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:244 +0x91
  github.com/docker/swarmkit/api._Dispatcher_Heartbeat_Handler()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1011 +0x1ec
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).processUnaryRPC()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:753 +0x10e4
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).handleStream()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:955 +0x14a0
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:517 +0xac

Goroutine 111 (running) created at:
  github.com/docker/swarmkit/manager.(*Manager).becomeLeader()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:994 +0x1c80
  github.com/docker/swarmkit/manager.(*Manager).handleLeadershipEvents()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:851 +0x449

Goroutine 106 (finished) created at:
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:515 +0xb8
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.(*http2Server).operateHeaders()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/http2_server.go:323 +0x1331
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.(*http2Server).HandleStreams()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/http2_server.go:387 +0xa28
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:513 +0x1d1
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveHTTP2Transport()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:506 +0x61c
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).handleRawConn()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:475 +0x5c0
==================
==================
WARNING: DATA RACE
Write at 0x00c420d72090 by goroutine 158:
  sync/atomic.CompareAndSwapInt32()
      /usr/local/go/src/runtime/race_amd64.s:293 +0xb
  sync.(*Mutex).Lock()
      /usr/local/go/src/sync/mutex.go:74 +0x4d
  context.(*cancelCtx).Done()
      /usr/local/go/src/context/context.go:326 +0x47
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).isRunning()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:447 +0x87
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:1152 +0xc3
  github.com/docker/swarmkit/api.(*authenticatedWrapperDispatcherServer).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:472 +0x17d
  github.com/docker/swarmkit/api.(*raftProxyDispatcherServer).Heartbeat()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1728 +0x272
  github.com/docker/swarmkit/api._Dispatcher_Heartbeat_Handler.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1009 +0xa1
  github.com/docker/swarmkit/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xfa
  github.com/docker/swarmkit/manager.New.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:244 +0x91
  github.com/docker/swarmkit/api._Dispatcher_Heartbeat_Handler()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/api/dispatcher.pb.go:1011 +0x1ec
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).processUnaryRPC()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:753 +0x10e4
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).handleStream()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:955 +0x14a0
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:517 +0xac

Previous write at 0x00c420d72090 by goroutine 111:
  context.WithCancel()
      /usr/local/go/src/context/context.go:231 +0xfc
  github.com/docker/swarmkit/vendor/golang.org/x/net/context.WithCancel()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/golang.org/x/net/context/go17.go:33 +0x6a
  github.com/docker/swarmkit/manager/dispatcher.(*Dispatcher).Run()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/dispatcher/dispatcher.go:268 +0x6a5
  github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func3()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:997 +0x1ea

Goroutine 158 (running) created at:
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:515 +0xb8
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.(*http2Server).operateHeaders()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/http2_server.go:323 +0x1331
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.(*http2Server).HandleStreams()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/http2_server.go:387 +0xa28
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveStreams()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:513 +0x1d1
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).serveHTTP2Transport()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:506 +0x61c
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*Server).handleRawConn()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/server.go:475 +0x5c0

Goroutine 111 (running) created at:
  github.com/docker/swarmkit/manager.(*Manager).becomeLeader()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:994 +0x1c80
  github.com/docker/swarmkit/manager.(*Manager).handleLeadershipEvents()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/manager.go:851 +0x449
==================
time="2018-03-28T21:34:46Z" level=info msg="Stopping manager" testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:52039: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:52039 <nil>}" module=grpc
time="2018-03-28T21:34:46Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:52039: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:52039 <nil>}" module=grpc
time="2018-03-28T21:34:46Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:52039: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:52039 <nil>}" module=grpc
time="2018-03-28T21:34:46Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:52039: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:52039 <nil>}" module=grpc
time="2018-03-28T21:34:46Z" level=info msg="Manager shut down" testHasExternalCA=false testname=TestManager
time="2018-03-28T21:34:46Z" level=info msg="Failed to dial 127.0.0.1:52039: context canceled; please retry." module=grpc
time="2018-03-28T21:34:46Z" level=info msg="Failed to dial 127.0.0.1:52039: context canceled; please retry." module=grpc
time="2018-03-28T21:34:46Z" level=info msg="Failed to dial 127.0.0.1:52039: context canceled; please retry." module=grpc
time="2018-03-28T21:34:46Z" level=info msg="Failed to dial 127.0.0.1:52039: context canceled; please retry." module=grpc
--- FAIL: TestManager (0.19s)
	testing.go:699: race detected during execution of test

The data race seems to be coming from this call of isRunning instead of isRunningLocked: https://github.com/docker/swarmkit/blob/master/manager/dispatcher/dispatcher.go#L1152

Not sure if the thing to do is to call isRunningLocked instead, or to merge mu and rpcRW.

@anshulpundir
Copy link
Contributor

The data race seems to be coming from this call of isRunning instead of isRunningLocked: https://github.com/docker/swarmkit/blob/master/manager/dispatcher/dispatcher.go#L1152

I think this can be ignored since isRunning() is a read-only op and if it reads stale value the only down side is an extra heartbeat processing.

Not sure if the thing to do is to call isRunningLocked instead, or to merge mu and rpcRW.

Merging mu and rpcRW is neither easy nor worth the effort. See #2525

@cyli
Copy link
Contributor Author

cyli commented Mar 28, 2018

I think this can be ignored since isRunning() is a read-only op and if it reads stale value the only down side is an extra heartbeat processing.

I agree that stale reads would be ok, but in general I'd prefer not to ignore data race warnings from the compiler since (1) we have to keep track of which ones are ok, and which ones aren't, and it'd be simpler if we should always just try to make CI pass. And (2) I think some go authors have specifically said that a data race is undefined behavior in golang, and the compiled code isn't guaranteed to behave correctly:

https://groups.google.com/forum/#!topic/golang-nuts/EHHMCdcenc8

@anshulpundir
Copy link
Contributor

Fair point. I'd definitely be interested in hearing ideas on how to address this without calling isRunningLocked() on every heartbeat.

@cyli
Copy link
Contributor Author

cyli commented Mar 29, 2018

Hmm... can we move the isRunningLocked check to require rpcRW.RLock() only, rather than d.mu.Lock? Since we only care about locking during the beginning of Run, during Stop, and to check if the dispatcher is running on RPC calls. The d.mu.Lock is for other attributes that will change as we receive cluster changes.

@anshulpundir
Copy link
Contributor

I think this could work. Thx for the suggestion! @cyli

#2580

@anshulpundir
Copy link
Contributor

Actually, thought about this more and I don't think it will not work. The streaming RPCs grab the read lock at the start of the function and hold on to the read lock until the context is cancelled. If the context is cancelled inside the write lock, then it'll never happen because the read lock will locked by the streaming RPC. @cyli

@cyli
Copy link
Contributor Author

cyli commented Jun 27, 2018

closed by #2664

@cyli cyli closed this as completed Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants