Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messager: do not lock when testing for openness #9134

Merged
merged 1 commit into from
Nov 12, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Nov 3, 2021

Description

When the Messager engine is shutting down and the Engine has been
closed, there may be other goroutines still attempting to access the
Engine to generate queries. Since the Engine takes its own internal
mutex while it's closing itself down, these other goroutines can
deadlock it by calling into any method in the Engine.

To prevent this deadlock, we're using an atomic test that will fail fast
any engine operations as soon as the engine has started shutting down.

Signed-off-by: Vicent Marti vmg@strn.cat


This is a potential fix for #8909, which is proving itself really hard to reproduce.

cc @derekperkins -- could you please try this in your system and see if the deadlocking stops?
cc @aquarapid @deepthi

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@derekperkins
Copy link
Member

Thanks for looking into this! We don't have our own fork/build system, so I pushed a copy of your branch into this repo and triggered a build for it. Once that's done, I'll roll it out later today to see if it solves the deadlock.

https://hub.docker.com/repository/registry-1.docker.io/vitess/base/builds/9dd7c9db-4b66-4515-980b-9d19bff09c5c

@derekperkins
Copy link
Member

I ended up cherry picking on top of the v12 release branch and deploying that - https://github.com/vitessio/vitess/tree/release-12.0-messager-deadlock

That has been running for 20-30 minutes for now, and I've performed 32 reparents so far without failure, which is far more than I've been able to do since upgrading past v6. I'll let it soak for a little bit, but it's looking promising.

@derekperkins
Copy link
Member

After some more reparenting, I was able to trigger a similar but new failure mode. It appears to have still deadlocked in the me.Close phase. The work in this PR was on the engine, but the same locking pattern also appears in the message manager. Maybe it needs to be similarly refactored, and/or also check the parent engine status?

https://github.com/vitessio/vitess/blob/release-12.0-messager-deadlock/go/vt/vttablet/tabletserver/messager/message_manager.go#L325-L329

Full logs and goroutine dumps
https://gist.github.com/derekperkins/ac6681eb7c9fa633f234208f4e446faa

Here's the most relevant part of the logs. It's hard to tell if the timing of the snapshot table lock caused the deadlock or if it just happened to be the next thing to run after the deadlock already happened.

I1103 19:42:12.088392       1 rpc_replication.go:364] DemotePrimary
I1103 19:42:12.090068       1 rpc_replication.go:414] DemotePrimary disabling query service
I1103 19:42:12.090089       1 state_manager.go:214] Starting transition to PRIMARY Not Serving, timestamp: 2021-11-03 19:13:28.774398929 +0000 UTC
I1103 19:42:12.090226       1 tablegc.go:193] TableGC: closing
I1103 19:42:14.301091       1 snapshot_conn.go:79] Locking table searches__extractor__msgs for copying
I1103 19:42:14.301836       1 snapshot_conn.go:72] Tables unlocked: searches__extractor__msgs
E1103 19:42:28.819765       1 message_manager.go:804] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN

There's still a pile up of GeneratePurgeQuery goroutines waiting to acquire the engine mutex, which is surprising, since that should be returning an error from ensureOpen.

In any event, this is still behaving much better than before and it feels like we're on the right track.

@derekperkins
Copy link
Member

I also see that the rogue goroutines in question appear to be using tabletenv.LocalContext(). Would it be helpful if the engine stored a context derived from that when it starts, then cancels it when Close() is called? I don't know anything about the scope of that local context, so I'm not sure if that's a good place to hook in.

@derekperkins
Copy link
Member

After a day, things seem to have devolved back to roughly where they were before, with similar deadlock rates - all the new ones. I still believe this is heading the right direction.

@vmg
Copy link
Collaborator Author

vmg commented Nov 10, 2021

Back from my holiday. I'm looking at this again today and trying to come up with a more comprehensive fix.

Signed-off-by: Vicent Marti <vmg@strn.cat>
@vmg
Copy link
Collaborator Author

vmg commented Nov 10, 2021

@derekperkins OK, I've just pushed an alternative and more comprehensive fix that removes the cyclic inter-dependency between the two objects. Can you give it a test run?

@derekperkins
Copy link
Member

Yep, I'll roll it out today. Thanks again for looking into it!

@derekperkins
Copy link
Member

It's been running for almost 24 hours now with no issues in message processing and no deadlocks, including 16 reparent operations under peak load, so this appears very promising. The code changes are pretty simple and look great.

@vmg
Copy link
Collaborator Author

vmg commented Nov 12, 2021

🎉 @deepthi I think we have a fix for @derekperkins' issue. Care to review?

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever. LGTM

@deepthi deepthi merged commit 77b9806 into vitessio:main Nov 12, 2021
@deepthi
Copy link
Member

deepthi commented Nov 12, 2021

@derekperkins should we backport to release-12.0?

@derekperkins
Copy link
Member

Yeah. I already had a branch targeting v12 which is what I'm running in prod, so I just opened a backport PR. #9210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants