messager: do not lock when testing for openness #9134

vmg · 2021-11-03T15:37:02Z

Description

When the Messager engine is shutting down and the Engine has been
closed, there may be other goroutines still attempting to access the
Engine to generate queries. Since the Engine takes its own internal
mutex while it's closing itself down, these other goroutines can
deadlock it by calling into any method in the Engine.

To prevent this deadlock, we're using an atomic test that will fail fast
any engine operations as soon as the engine has started shutting down.

Signed-off-by: Vicent Marti vmg@strn.cat

This is a potential fix for #8909, which is proving itself really hard to reproduce.

cc @derekperkins -- could you please try this in your system and see if the deadlocking stops?
cc @aquarapid @deepthi

Related Issue(s)

vttablet: apparent deadlock when running PlannedReparentShard #8909

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

derekperkins · 2021-11-03T17:48:30Z

Thanks for looking into this! We don't have our own fork/build system, so I pushed a copy of your branch into this repo and triggered a build for it. Once that's done, I'll roll it out later today to see if it solves the deadlock.

https://hub.docker.com/repository/registry-1.docker.io/vitess/base/builds/9dd7c9db-4b66-4515-980b-9d19bff09c5c

derekperkins · 2021-11-03T19:16:39Z

I ended up cherry picking on top of the v12 release branch and deploying that - https://github.com/vitessio/vitess/tree/release-12.0-messager-deadlock

That has been running for 20-30 minutes for now, and I've performed 32 reparents so far without failure, which is far more than I've been able to do since upgrading past v6. I'll let it soak for a little bit, but it's looking promising.

derekperkins · 2021-11-03T20:09:42Z

After some more reparenting, I was able to trigger a similar but new failure mode. It appears to have still deadlocked in the me.Close phase. The work in this PR was on the engine, but the same locking pattern also appears in the message manager. Maybe it needs to be similarly refactored, and/or also check the parent engine status?

https://github.com/vitessio/vitess/blob/release-12.0-messager-deadlock/go/vt/vttablet/tabletserver/messager/message_manager.go#L325-L329

Full logs and goroutine dumps
https://gist.github.com/derekperkins/ac6681eb7c9fa633f234208f4e446faa

Here's the most relevant part of the logs. It's hard to tell if the timing of the snapshot table lock caused the deadlock or if it just happened to be the next thing to run after the deadlock already happened.

I1103 19:42:12.088392       1 rpc_replication.go:364] DemotePrimary
I1103 19:42:12.090068       1 rpc_replication.go:414] DemotePrimary disabling query service
I1103 19:42:12.090089       1 state_manager.go:214] Starting transition to PRIMARY Not Serving, timestamp: 2021-11-03 19:13:28.774398929 +0000 UTC
I1103 19:42:12.090226       1 tablegc.go:193] TableGC: closing
I1103 19:42:14.301091       1 snapshot_conn.go:79] Locking table searches__extractor__msgs for copying
I1103 19:42:14.301836       1 snapshot_conn.go:72] Tables unlocked: searches__extractor__msgs
E1103 19:42:28.819765       1 message_manager.go:804] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN

There's still a pile up of GeneratePurgeQuery goroutines waiting to acquire the engine mutex, which is surprising, since that should be returning an error from ensureOpen.

In any event, this is still behaving much better than before and it feels like we're on the right track.

derekperkins · 2021-11-03T20:16:39Z

I also see that the rogue goroutines in question appear to be using tabletenv.LocalContext(). Would it be helpful if the engine stored a context derived from that when it starts, then cancels it when Close() is called? I don't know anything about the scope of that local context, so I'm not sure if that's a good place to hook in.

derekperkins · 2021-11-04T20:22:14Z

After a day, things seem to have devolved back to roughly where they were before, with similar deadlock rates - all the new ones. I still believe this is heading the right direction.

vmg · 2021-11-10T15:54:46Z

Back from my holiday. I'm looking at this again today and trying to come up with a more comprehensive fix.

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg · 2021-11-10T16:19:21Z

@derekperkins OK, I've just pushed an alternative and more comprehensive fix that removes the cyclic inter-dependency between the two objects. Can you give it a test run?

derekperkins · 2021-11-10T16:43:46Z

Yep, I'll roll it out today. Thanks again for looking into it!

derekperkins · 2021-11-12T00:15:15Z

It's been running for almost 24 hours now with no issues in message processing and no deadlocks, including 16 reparent operations under peak load, so this appears very promising. The code changes are pretty simple and look great.

vmg · 2021-11-12T09:26:18Z

🎉 @deepthi I think we have a fix for @derekperkins' issue. Care to review?

deepthi

Very clever. LGTM

deepthi · 2021-11-12T15:45:05Z

@derekperkins should we backport to release-12.0?

derekperkins · 2021-11-12T16:08:00Z

Yeah. I already had a branch targeting v12 which is what I'm running in prod, so I just opened a backport PR. #9210

vmg requested review from harshit-gangal, shlomi-noach and systay as code owners November 3, 2021 15:37

messager: cleanup interdependecies

26a7cf0

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg force-pushed the messager-deadlock branch from 8ee4b0d to 26a7cf0 Compare November 10, 2021 16:18

vmg added Component: Cluster management release notes none Type: Bug labels Nov 12, 2021

deepthi approved these changes Nov 12, 2021

View reviewed changes

deepthi merged commit 77b9806 into vitessio:main Nov 12, 2021

derekperkins mentioned this pull request Nov 12, 2021

Backport to v12: messager: cleanup interdependecies #9210

Merged

derekperkins mentioned this pull request Mar 22, 2022

Bug Report: vttablet hanging when running PlannedReparentShard #9936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

messager: do not lock when testing for openness #9134

messager: do not lock when testing for openness #9134

vmg commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 4, 2021

vmg commented Nov 10, 2021

vmg commented Nov 10, 2021

derekperkins commented Nov 10, 2021

derekperkins commented Nov 12, 2021

vmg commented Nov 12, 2021

deepthi left a comment

deepthi commented Nov 12, 2021

derekperkins commented Nov 12, 2021

messager: do not lock when testing for openness #9134

messager: do not lock when testing for openness #9134

Conversation

vmg commented Nov 3, 2021

Description

Related Issue(s)

Checklist

Deployment Notes

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 3, 2021

derekperkins commented Nov 4, 2021

vmg commented Nov 10, 2021

vmg commented Nov 10, 2021

derekperkins commented Nov 10, 2021

derekperkins commented Nov 12, 2021

vmg commented Nov 12, 2021

deepthi left a comment

Choose a reason for hiding this comment

deepthi commented Nov 12, 2021

derekperkins commented Nov 12, 2021