Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messager: Fix a deadlock bug #2594

Merged
merged 1 commit into from
Feb 27, 2017
Merged

messager: Fix a deadlock bug #2594

merged 1 commit into from
Feb 27, 2017

Conversation

sougou
Copy link
Contributor

@sougou sougou commented Feb 24, 2017

The root cause of the deadlock is that message manager calls
into tabletserver, which calls back into it.
This can cause deadlocks and Close can hang forever.

BUG=35763775

@mberlin-bot
Copy link

Based on this description, it sounds like you're shutting down tabletserver too early and therefore calls are hanging.

If you shutdown the message manager first, wait for everything to finish and then shut down the tabletserver, then there should be no deadlock? (I understand that just running things asynchronously may be a reasonable short cut for this problem.)

I must admit that I don't know anything about the message functionality and also have a hard time understanding what postpone and discard are doing.

However, I don't see any harm in this change. So feel free to submit it.


Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


go/vt/tabletserver/message_manager.go, line 110 at r1 (raw file):

to made
to be made?


go/vt/tabletserver/message_manager.go, line 331 at r1 (raw file):

// Postpone the messages for resend before discarding

Based on this comment you should not call Discard until postpone has returned? The new change would contradict this :)


Comments from Reviewable

@sougou-bot
Copy link

I do shut down tabletserver first. But this deadlock is a different use case where the message table is dropped while tabletserver is running. The call actually originates from SchemaEngine, which tries to close the service for that table, which obtains a lock. In the meantime, that same manager is trying to purge rows through tabletserver, which ends up waiting for the same lock. But the close method cannot complete until the purge thread returns. This causes the deadlock.


Review status: all files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@michael-berlin
Copy link
Contributor

Alternatively, the close could abort all pending purge threads first and wait for them to finish?

Just brainstorming alternatives here. If the async calls are good enough, I'm fine with that.

@sougou
Copy link
Contributor Author

sougou commented Feb 25, 2017

Once that thread goes into the lock wait, then we're stuck till the lock is released. I actually considered another alternative, which is to perform the close asynchronously. That would have worked.

But that would have only been a spot fix. This new approach is easier to reason about and verify: Top-to-down calls are performed synchronously and obtain locks. Bottom-to-top calls (MessageManager->TabletServer) must be done asynchronously, as if the request is originating from outside.

The root cause of the deadlock is that message manager calls
into tabletserver, which calls back into it.
This can cause deadlocks and Close can hang forever.

BUG=35763775
@sougou
Copy link
Contributor Author

sougou commented Feb 27, 2017

LGTM
by proxy

Approved with PullApprove

@sougou sougou merged commit 2cdce28 into vitessio:master Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants