Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

deadlock in route calculation #643

Closed
rade opened this issue May 10, 2015 · 7 comments · Fixed by #644
Closed

deadlock in route calculation #643

rade opened this issue May 10, 2015 · 7 comments · Fixed by #644
Assignees
Labels
Milestone

Comments

@rade
Copy link
Member

rade commented May 10, 2015

I came across this when running NUM_WEAVES=50 bin/multiweave launch -connlimit 100...

The pertinent part of the stack trace is

goroutine 6 [semacquire, 4 minutes]:
sync.(*RWMutex).RLock(0xc2092dcaf0)
    /usr/local/go/src/sync/rwmutex.go:36 +0x5f
github.com/weaveworks/weave/router.(*Peer).Routes(0xc208010b60, 0x0, 0xc209253100, 0x0, 0x0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peer.go:167 +0x427
github.com/weaveworks/weave/router.(*Peers).garbageCollect(0xc2090cf500, 0x0, 0x0, 0x0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peers.go:178 +0x9b
github.com/weaveworks/weave/router.(*Peers).GarbageCollect(0xc2090cf500, 0x0, 0x0, 0x0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peers.go:147 +0x82
github.com/weaveworks/weave/router.(*LocalPeer).handleDeleteConnection(0xc2090d2fc0, 0x7fe5b7ca7760, 0xc2093001c0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/local_peer.go:227 +0x3d3
github.com/weaveworks/weave/router.(*LocalPeer).handleAddConnection(0xc2090d2fc0, 0x7fe5b7ca7760, 0xc2092667e0, 0x0, 0x0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/local_peer.go:174 +0x58e
github.com/weaveworks/weave/router.func·019()
    /home/matthias/go/src/github.com/weaveworks/weave/router/local_peer.go:121 +0x70
github.com/weaveworks/weave/router.(*LocalPeer).actorLoop(0xc2090d2fc0, 0xc20805c1e0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/local_peer.go:150 +0xac
created by github.com/weaveworks/weave/router.(*LocalPeer).Start
    /home/matthias/go/src/github.com/weaveworks/weave/router/local_peer.go:25 +0x7e

goroutine 8 [semacquire, 4 minutes]:
sync.(*RWMutex).RLock(0xc2092dca80)
    /usr/local/go/src/sync/rwmutex.go:36 +0x5f
github.com/weaveworks/weave/router.(*Peer).Routes(0xc208010b60, 0x0, 0x1, 0xc20801b700, 0x0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peer.go:167 +0x427
github.com/weaveworks/weave/router.(*Routes).calculateUnicast(0xc20805c180, 0x1, 0x2)
    /home/matthias/go/src/github.com/weaveworks/weave/router/routes.go:193 +0x3e
github.com/weaveworks/weave/router.(*Routes).calculate(0xc20805c180)
    /home/matthias/go/src/github.com/weaveworks/weave/router/routes.go:170 +0x2d
github.com/weaveworks/weave/router.(*Routes).run(0xc20805c180, 0xc20805c240, 0xc20805c2a0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/routes.go:156 +0x95
created by github.com/weaveworks/weave/router.(*Routes).Start
    /home/matthias/go/src/github.com/weaveworks/weave/router/routes.go:44 +0xd4

goroutine 35 [semacquire, 4 minutes]:
sync.(*RWMutex).Lock(0xc2092dcaf0)
    /usr/local/go/src/sync/rwmutex.go:87 +0xa4
github.com/weaveworks/weave/router.(*Peer).DecrementLocalRefCount(0xc2092dcaf0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peer.go:58 +0x2f
github.com/weaveworks/weave/router.(*LocalConnection).shutdown(0xc2093001c0, 0x7fe5b8edc4c0, 0xc2093e7ef0)
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:415 +0x1d5
github.com/weaveworks/weave/router.func·005()
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:288 +0x37
github.com/weaveworks/weave/router.(*LocalConnection).run(0xc2093001c0, 0xc20922a780, 0xc20922a7e0, 0x8ec700)
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:316 +0x3f4
created by github.com/weaveworks/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:120 +0xdd

goroutine 33 [semacquire, 4 minutes]:
sync.(*RWMutex).Lock(0xc2092dca80)
    /usr/local/go/src/sync/rwmutex.go:87 +0xa4
github.com/weaveworks/weave/router.(*Peer).DecrementLocalRefCount(0xc2092dca80)
    /home/matthias/go/src/github.com/weaveworks/weave/router/peer.go:58 +0x2f
github.com/weaveworks/weave/router.(*LocalConnection).shutdown(0xc209300000, 0x7fe5b7ca7998, 0xc2093e8680)
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:415 +0x1d5
github.com/weaveworks/weave/router.func·005()
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:288 +0x37
github.com/weaveworks/weave/router.(*LocalConnection).run(0xc209300000, 0xc20922a0c0, 0xc20922a120, 0x8ec700)
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:316 +0x3f4
created by github.com/weaveworks/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/weaveworks/weave/router/connection.go:120 +0xdd

The 1st goroutine is the LocalPeer actor, which performs a route calculation in order to figure out which peers to gc after a connection got deleted. The 2nd goroutine is the Routes actor, which is performing one of its several route calculations. Both goroutines are stuck in the same place, attempting to acquire a read lock on a (different) Peer. The reason they can't get the lock is that there are pending write lock acquisitions on the two peers, by the last two goroutines, which are terminating connection actors attempting to decrement the reference count.

Why don't the write lock acquisitions succeed? The reason must be that some other goroutine(s) hold a read lock. We cannot see that from the stack traces, but by far the most likely candidates are the first two goroutines holding a read lock on the peer the other is attempting to acquire.

The issue here then is that the two route calculations attempt to acquire the locks in a different order. Partially by accident - due to golang's random map traversal (of connection maps, in this case) - and partially by design (since different route calculations perform different traversals of the topology).

The easiest way to fix this is probably to prevent concurrent route calculations, especially since there are only two goroutines - the ones we are seeing here - that ever perform these. It is also worth observing that the route calculation performed by the peer gc is identical to the unicastAll route calculation performed by the Routes actor.

@rade rade added the bug label May 10, 2015
@dpw
Copy link
Contributor

dpw commented May 10, 2015

The per-peer locks always seemed like overkill to me, and that kind of fine-grained locking is hard to get right. What about just having a router-wide lock covering all routing information?

@rade
Copy link
Member Author

rade commented May 10, 2015

I did consider that but such a lock would be on the critical path for everything, and would have very high contention. I'm not ruling out such an approach, but atm I'm not giving up on the peer locks just yet.

@dpw
Copy link
Contributor

dpw commented May 10, 2015

would have very high contention

I guess I'm missing something here - what's the reason to be concerned about contention?

I can imagine that read contention could be high. But with a RW lock, we shouldn't need to be too concerned about that. I had assumed that write operations are mostly in response to topology changes, and so are not terribly frequent. And if the lock in question just guards routing and peer information, write critical sections shouldn't be able to block on anything else. And I'd guess that even the main routing calculation takes microseconds for networks of the maximum size we are concerned with at present.

@rade
Copy link
Member Author

rade commented May 10, 2015

Weave is meant to cope fairly well with a fairly high churn rate in the topology. With a single lock, all topology changes would stall the data path. No idea how much a problem that is.

@dpw
Copy link
Contributor

dpw commented May 10, 2015

What's a fairly high rate? I'm guessing that means thousands of changes a second, not millions. That's actually a modest level of contention if the critical sections are short. I might well be overlooking something, but the only area that springs to my mind as a critical section that wuodln't be short is the routing calculation.

If we are genuinely aiming for high-performance concurrency, it might be worth looking at how the Linux kernel handles this kind of read-mostly situation: not through conventional locking, but through RCU (read-copy-update). In RCU, readers don't need to take a lock, because they just read an immutable data structure. Writers update a copy of that data structure, and publish it with an atomic pointer update (this needs to be done with a little care, due to CPU memory ordering models; this is what atomic.StorePointer is for in go).

(The implementation of RCU in the Linux kernel is a lot more involved than that, because there is no garbage collection to clean up old copies. The details of this and many other low-level concurrency techniques are described in Paul McKenney's work-in-progress book at https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html )

The kernel does have RW locks, but RCU is often preferred for two reasons:

  • Although RW locks promise that concurrent readers can take the lock without contention, real-world RW lock implementations write to the lock's cache line even when a reader acquires or releases. The resulting cache line contention can mean that, in many cases, RW locks provide no more scalability than exclusive locks. RCU does not suffer from this, because RCU readers only read from the shared memory locations.
  • With RCU, while a writer is computing the updated data structure, readers can still proceed on the basis of the old copy.

That second point is particularly relevant to the issue you raised of avoiding stalling weave's data path while topology changes and routing calculations are occurring.

@rade
Copy link
Member Author

rade commented May 10, 2015

Weave does not hold write locks during routing calculations.

Re alternative concurrency features...I have little appetite to complete revamp weave's entire concurrency control as part of this issue.

Anyway, I will see how hideous usage of a single lock gets. Lack of re-entrancy on golang's locks is one obstacle, which either requires a set of parallel APIs or breaking of encapsulation. But perhaps we are lucky and few methods on Peer and Connection are called in both contexts with and without the single lock.

@rade
Copy link
Member Author

rade commented May 11, 2015

@dpw btw, there is only one place in the code where we acquire locks on multiple peers - the route calculation.

@rade rade self-assigned this May 11, 2015
rade added a commit to rade/weave that referenced this issue May 11, 2015
- get rid of individual Peer locks
- hold Peers read (or write) lock and LocalPeer read lock while
  performing route calculations

This eliminates the deadlock resulting from the inconsistent lock
acquisition order in route calculcation.

Fixes weaveworks#643.
rade added a commit to rade/weave that referenced this issue May 11, 2015
- get rid of individual Peer locks
- hold Peers read (or write) lock and LocalPeer read lock while
  performing route calculations

This eliminates the deadlock resulting from the inconsistent lock
acquisition order in route calculcation.

Fixes weaveworks#643.
@dpw dpw closed this as completed in #644 May 11, 2015
dpw added a commit that referenced this issue May 11, 2015
eliminate route calculcation deadlock

Fixes #643.
@rade rade modified the milestone: 0.11.0 May 12, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants