messager: Fix a deadlock bug #2594

sougou · 2017-02-24T23:46:54Z

The root cause of the deadlock is that message manager calls
into tabletserver, which calls back into it.
This can cause deadlocks and Close can hang forever.

BUG=35763775

mberlin-bot · 2017-02-25T02:48:55Z

Based on this description, it sounds like you're shutting down tabletserver too early and therefore calls are hanging.

If you shutdown the message manager first, wait for everything to finish and then shut down the tabletserver, then there should be no deadlock? (I understand that just running things asynchronously may be a reasonable short cut for this problem.)

I must admit that I don't know anything about the message functionality and also have a hard time understanding what postpone and discard are doing.

However, I don't see any harm in this change. So feel free to submit it.

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

go/vt/tabletserver/message_manager.go, line 110 at r1 (raw file):

to made
to be made?

go/vt/tabletserver/message_manager.go, line 331 at r1 (raw file):

// Postpone the messages for resend before discarding

Based on this comment you should not call Discard until postpone has returned? The new change would contradict this :)

Comments from Reviewable

sougou-bot · 2017-02-25T03:02:37Z

I do shut down tabletserver first. But this deadlock is a different use case where the message table is dropped while tabletserver is running. The call actually originates from SchemaEngine, which tries to close the service for that table, which obtains a lock. In the meantime, that same manager is trying to purge rows through tabletserver, which ends up waiting for the same lock. But the close method cannot complete until the purge thread returns. This causes the deadlock.

Review status: all files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

michael-berlin · 2017-02-25T03:09:05Z

Alternatively, the close could abort all pending purge threads first and wait for them to finish?

Just brainstorming alternatives here. If the async calls are good enough, I'm fine with that.

sougou · 2017-02-25T04:02:29Z

Once that thread goes into the lock wait, then we're stuck till the lock is released. I actually considered another alternative, which is to perform the close asynchronously. That would have worked.

But that would have only been a spot fix. This new approach is easier to reason about and verify: Top-to-down calls are performed synchronously and obtain locks. Bottom-to-top calls (MessageManager->TabletServer) must be done asynchronously, as if the request is originating from outside.

The root cause of the deadlock is that message manager calls into tabletserver, which calls back into it. This can cause deadlocks and Close can hang forever. BUG=35763775

sougou · 2017-02-27T21:23:19Z

LGTM
by proxy

googlebot added the cla: yes label Feb 24, 2017

sougou requested a review from michael-berlin February 24, 2017 23:47

messager: Fix a deadlock bug

c67d56f

The root cause of the deadlock is that message manager calls into tabletserver, which calls back into it. This can cause deadlocks and Close can hang forever. BUG=35763775

sougou force-pushed the bugs branch from 0a9aa2d to c67d56f Compare February 25, 2017 06:36

sougou merged commit 2cdce28 into vitessio:master Feb 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

messager: Fix a deadlock bug #2594

messager: Fix a deadlock bug #2594

sougou commented Feb 24, 2017

mberlin-bot commented Feb 25, 2017

sougou-bot commented Feb 25, 2017

michael-berlin commented Feb 25, 2017

sougou commented Feb 25, 2017

sougou commented Feb 27, 2017 •

edited by alainjobart

Loading

messager: Fix a deadlock bug #2594

messager: Fix a deadlock bug #2594

Conversation

sougou commented Feb 24, 2017

mberlin-bot commented Feb 25, 2017

sougou-bot commented Feb 25, 2017

michael-berlin commented Feb 25, 2017

sougou commented Feb 25, 2017

sougou commented Feb 27, 2017 • edited by alainjobart Loading

sougou commented Feb 27, 2017 •

edited by alainjobart

Loading