Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vttablet: apparent deadlock when running PlannedReparentShard #8909

Closed
derekperkins opened this issue Sep 29, 2021 · 21 comments · Fixed by #8965
Closed

vttablet: apparent deadlock when running PlannedReparentShard #8909

derekperkins opened this issue Sep 29, 2021 · 21 comments · Fixed by #8965

Comments

@derekperkins
Copy link
Member

Overview of the Issue

We have successfully executed thousands of PlannedReparentShard commands before upgrading to v11, and now they are very unreliable (25-100% failure rate with 0% recovery on successive attempts) on tablets with many concurrent vstreams. I can't prove that VStream is causing it, but my guess is that's where we're different from other Vitess clusters. We also have several Materialize workflows running cross-keyspaces.

I have linked to full log output (https://gist.github.com/derekperkins/299519b0bd9da3618645de908a37ca0c), but this seems to be where things are getting locked:

I0929 03:47:08.552644 1897126 rpc_replication.go:404] DemotePrimary disabling query service
I0929 03:47:08.552668 1897126 state_manager.go:214] Starting transition to MASTER Not Serving, timestamp: 2021-09-28 23:56:35.082672802 +0000 UTC
I0929 03:47:08.552753 1897126 tablegc.go:193] TableGC: closing

When I do a reparent on another keyspace that doesn't have so many streams, this is what I see immediately following TableGC: closing. It looks like there must be some race condition with how the Messager is closing, leading to a deadlock

I0929 23:50:08.869083   24777 engine.go:102] Messager: closed
I0929 23:50:08.869135   24777 tx_engine.go:333] No grace period specified: performing normal wait.
I0929 23:50:08.871506   24777 tx_engine.go:204] TxEngine: closed
I0929 23:50:08.871536   24777 state_manager.go:504] Killing all OLAP queries.
I0929 23:50:08.904262   24777 repltracker.go:86] Replication Tracker: going into master mode
I0929 23:50:08.904311   24777 state_manager.go:563] TabletServer transition: MASTER: Serving, Sep 29, 2021 at 04:20:33 (UTC) -> MASTER: Not Serving, Sep 29, 2021 at 04:20:33 (UTC) for tablet :companies/0
I0929 23:50:08.904330   24777 state_manager.go:663] State: exiting lameduck
I0929 23:50:08.926629   24777 query.go:81] exec SET GLOBAL read_only = ON

I turned off 3 Materialize workflows (16 shards for 48 vreplication rows) and things seem to have improved. I do need those running and have many others that I plan on starting, so I am really hoping to get this resolved in time for v12.

Operating system and Environment details

GKE 1.20.9
vttablet: 11.0.0
vtctld: 11.0.0

@derekperkins
Copy link
Member Author

When this happens, we have an outage. The resolution is:

  1. Run PlannedReparentShard (this triggers the bad state), and the replica IS NOT promoted
  2. Force kill the tablet (kubectl delete --force)
  3. When it comes back up, it is in read-only mode. Run PlannedReparentShard again. This deadlocks again, but for some reason the replica IS promoted. Both vtctld and vtgate report both as primary, but the replica has the more recent timestamp, so it is effectively treated as the primary
  4. Force kill the same tablet again
  5. This time when it comes back up, it recognizes the other tablet as the primary and loads correctly as a replica

While this results in an outage of a few minutes, longer for keyspaces with more shards, at least there is no data lost or errant gtids.

@sougou
Copy link
Contributor

sougou commented Oct 1, 2021

My initial theory is that it's caused by those table locks. We could look at drilling down there, or we could use a different approach:We don't really need to lock tables any more.

Once you start a snapshot-ed transaction, the data in the table won't change within that transaction. So, there's no need to lock the table. Option 2 would be to shut down all streams before demoting, but I think that's not necessary (if we solve the problem by avoiding locks).

@derekperkins
Copy link
Member Author

Here's a full debugging dump from a tablet that got stuck
https://gist.github.com/derekperkins/e1d549e7107980e0f5625e0c4448257b

@derekperkins derekperkins reopened this Oct 15, 2021
@aquarapid
Copy link
Contributor

Here's a full debugging dump from a tablet that got stuck https://gist.github.com/derekperkins/e1d549e7107980e0f5625e0c4448257b

There seems to be an awful lot of goroutines running here. How many vreplication flows were actually running at the time of this snapshot?

@derekperkins
Copy link
Member Author

Only 2-3 vreplication workflows running, but there are 27 message tables open, each of which starts a VStream connection.

@aquarapid
Copy link
Contributor

OK, I think I have tracked this one down by looking at
the logs and the goroutine dump.

It seems the situation for the vttablet is like this:

  • We are in state_manager:closeAll() calling state_manager:unserveCommon()
  • In unserveCommon() we call table_gc.Close(), which succeeds, as can be
    seen from the message:
I0929 03:47:08.552753 1897126 tablegc.go:193] TableGC: closing
  • We then call throttler.Close(); this seems to succeed (nothing in the
    goroutine dump)
  • Next, we call messager.Close() in messager/engine.go; this is where
    things go south:
    • messager.Close() takes and holds the me mutex
    • unregisters the schema engine notifiers the messager engine keeps.
      This seems to succeed.
    • It then iterates the list of message managers, calling Close() on
      each (message_manager.go)
    • This goes through and does various housekeeping, then sends a
      broadcast for the running runSend() goroutines to exit.
    • it then Wait()s on the message manager WaitGroup. In the goroutine
      dump, it seems to be stuck there
    • Looking at the WaitGroup, it will only unblock once the runSend()
      goroutines exit, this does not seem to happen
    • Various runSend() routines are blocking on the message_manager.cond
      condition variable
    • meanwhile, message_manager:purge(...) is still running, looping over
      tabletserver:PurgeMessages(...), which ends up calling message_manager's
      engine:GeneratePurgeQuery(...), which immediately tries to take a
      message manager engine mutex (me) that is already being held from
      messager.Close().
    • other runSend() routines are running a goroutine to
      message_manager:send(..) (adding to the message manager WaitGroup).
      • send(..) calls message_manager:postpone(...)
      • postpone(...) calls tabletserver:PostponeMessages(...)
      • this also ends up calling engine:GeneratePurgeQuery(...), which
        tries to take the message manager engine mutex (me) as above, again
        already held by messager.Close()

So, it seems we have a deadlock, messager.Close() holds the me mutex,
and is waiting on the WaitGroup for the runSend() and send() goroutines
to finish, all blocked trying to acquire the me mutex in
me.GeneratePurgeQuery(...)

@derekperkins
Copy link
Member Author

Thanks for tracking this down @aquarapid! That matches the behavior I see, where it continues to loop trying to purge, always failing because it's in read only.

@derekperkins
Copy link
Member Author

I upgraded us to v12 today, and unsurprisingly, this minor PR did not fix the problem. #8965

@vmg
Copy link
Collaborator

vmg commented Nov 17, 2021

Ostensibly fixed. :)

@vmg vmg closed this as completed Nov 17, 2021
@derekperkins
Copy link
Member Author

derekperkins commented Feb 28, 2022

I think this is still happening, just less regularly. I'll try upgrading to v13 to see if I still see it.

on PlannedReparentShard

I0228 23:18:50.764419  486334 rpc_replication.go:364] DemotePrimary
I0228 23:18:50.778084  486334 rpc_replication.go:414] DemotePrimary disabling query service
I0228 23:18:50.778107  486334 state_manager.go:214] Starting transition to PRIMARY Not Serving, timestamp: 2022-02-23 22:48:23.345711755 +0000 UTC
I0228 23:18:50.778224  486334 tablegc.go:193] TableGC: closing
E0228 23:19:00.830227  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.830245  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852711  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852738  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852737  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852775  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852793  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852790  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852804  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852808  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852800  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852826  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852832  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852839  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852900  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852940  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852960  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.852965  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.854032  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.854063  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.854066  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.854084  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.854093  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.886179  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.910510  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.915671  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.915692  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:19:00.915710  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
I0228 23:19:08.657145  486334 snapshot_conn.go:79] Locking table searches__retriever_dataforseo__msgs for copying
I0228 23:19:08.657184  486334 snapshot_conn.go:79] Locking table searches__retriever_vertifire__msgs for copying
I0228 23:19:08.657852  486334 snapshot_conn.go:72] Tables unlocked: searches__retriever_vertifire__msgs
I0228 23:19:08.657926  486334 snapshot_conn.go:72] Tables unlocked: searches__retriever_dataforseo__msgs
I0228 23:19:09.220336  486334 snapshot_conn.go:79] Locking table rankings__searches_listener__msgs for copying
I0228 23:19:09.220958  486334 snapshot_conn.go:72] Tables unlocked: rankings__searches_listener__msgs

on container shutdown

operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855362  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855343  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855370  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855348  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855383  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855386  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855391  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.855395  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.856599  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.856627  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.856637  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.856649  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.856661  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.889016  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.913037  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.918204  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.918230  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
E0228 23:21:00.918234  486334 message_manager.go:805] Unable to delete messages: Code: CLUSTER_EVENT
operation not allowed in state SHUTTING_DOWN
I0228 23:21:01.333630  486334 run.go:61] Entering lameduck mode for at least 50ms
I0228 23:21:01.333675  486334 run.go:62] Firing asynchronous OnTerm hooks
I0228 23:21:01.333691  486334 servenv.go:183] Firing OnTermSync hooks and waiting up to 10s for them
I0228 23:21:01.333718  486334 grpc_server.go:249] Initiated graceful stop of gRPC server
I0228 23:21:01.333852  486334 updatestreamctl.go:243] Update Stream Disabled
I0228 23:21:01.334131  486334 controller.go:147] stream 34: stopped
I0228 23:21:01.334251  486334 controller.go:147] stream 36: stopped
I0228 23:21:01.334363  486334 controller.go:147] stream 33: stopped
I0228 23:21:01.334537  486334 engine.go:308] VReplication Engine: closed
I0228 23:21:08.667857  486334 snapshot_conn.go:79] Locking table searches__retriever_dataforseo__msgs for copying
I0228 23:21:08.667885  486334 snapshot_conn.go:79] Locking table searches__retriever_vertifire__msgs for copying
I0228 23:21:08.668527  486334 snapshot_conn.go:72] Tables unlocked: searches__retriever_vertifire__msgs
I0228 23:21:08.668537  486334 snapshot_conn.go:72] Tables unlocked: searches__retriever_dataforseo__msgs
I0228 23:21:09.229972  486334 snapshot_conn.go:79] Locking table rankings__searches_listener__msgs for copying
I0228 23:21:09.230582  486334 snapshot_conn.go:72] Tables unlocked: rankings__searches_listener__msgs
I0228 23:21:11.334716  486334 servenv.go:199] OnTermSync hooks timed out
I0228 23:21:11.334745  486334 run.go:71] Shutting down gracefully
I0228 23:21:11.334750  486334 servenv.go:183] Firing OnClose hooks and waiting up to 1ns for them
I0228 23:21:11.334803  486334 servenv.go:199] OnClose hooks timed out
I0228 23:21:11.334826  486334 shard_watcher.go:56] Stopping shard watch...

@derekperkins
Copy link
Member Author

Here's a new gist with updated goroutine/block/mutex pprof profile dumps
https://gist.github.com/derekperkins/b3526f41b179b06b3280d49d5b5d05c8

@vmg
Copy link
Collaborator

vmg commented Mar 8, 2022

Looking at this again. 🥲 ✨

@vmg
Copy link
Collaborator

vmg commented Mar 8, 2022

@derekperkins: is the block profile gist complete? These are the only possible culprits for a deadlock:

1158957548311 217811 @ 0x47130c 0x1154c7e 0x116bac5 0x116ac12 0x469081
#	0x47130b	sync.(*Cond).Wait+0x8b										sync/cond.go:56
#	0x1154c7d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Fetch+0x13d		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:98
#	0x116bac4	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x204		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:343
#	0x116ac11	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

1774172 41 @ 0x11548a5 0x115487e 0x116adae 0x12fea10 0x1153da9 0x116ad31 0x469081
#	0x11548a4	sync.(*Mutex).Lock+0x84											sync/mutex.go:81
#	0x115487d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Send+0x5d			vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:69
#	0x116adad	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func1.1+0x2d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:149
#	0x12fea0f	vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).VStream+0xef				vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:643
#	0x1153da8	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*tabletConnector).VStream+0x68		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/external_connector.go:174
#	0x116ad30	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func1+0xf0		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:148

1523003 11 @ 0x472ddb 0x47131e 0x1154c7e 0x116bac5 0x116ac12 0x469081
#	0x472dda	sync.(*Mutex).Lock+0x3a										sync/mutex.go:81
#	0x47131d	sync.(*Cond).Wait+0x9d										sync/cond.go:57
#	0x1154c7d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Fetch+0x13d		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:98
#	0x116bac4	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x204		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:343
#	0x116ac11	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

323686 7 @ 0x121108c 0x1211048 0x121fbf4 0x121fac8 0x11f2619 0x469081
#	0x121108b	sync.(*Mutex).Lock+0x8b										sync/mutex.go:81
#	0x1211047	vitess.io/vitess/go/vt/vttablet/tabletserver.(*stateManager).StartRequest+0x47			vitess.io/vitess/go/vt/vttablet/tabletserver/state_manager.go:348
#	0x121fbf3	vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).execDML+0x73			vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:1078
#	0x121fac7	vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).PurgeMessages+0x67			vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:1071
#	0x11f2618	vitess.io/vitess/go/vt/vttablet/tabletserver/messager.(*messageManager).runPurge.func1+0x178	vitess.io/vitess/go/vt/vttablet/tabletserver/messager/message_manager.go:802

Everything else looks like noise. tabletserver.(*stateManager).StartRequest sure looks fishy, but there's no counterpart for the Lock on that mutex. It seems like it shouldn't be blocking at all. Could it be some missing data from the dump?

@vmg
Copy link
Collaborator

vmg commented Mar 8, 2022

FWIW this appears to be an unrelated bug to what we were seeing before, as there are no WaitGroups involved.

@derekperkins
Copy link
Member Author

derekperkins commented Mar 9, 2022

Thanks for looking into this again, I really appreciate it.

is the block profile gist complete?

I'm not 100% sure how it gets populated. I enabled the blocking profile and then grabbed the dump, but I didn't wait for a very long time since it was an outage, maybe 30 seconds tops, so if it might have taken longer than that to fully populate, then maybe it was a partial.

I have a sneaking suspicion that it was related to the lack of an index, causing the vstream to block somewhere. Over the weekend we had a 10x spike in our workload, causing all of our queues to backup into the millions of rows. A few weeks ago, the messager just stopped running entirely when it got too long, and after a lot of debugging, I did an explain on the vstream query. I thought that it would use KEY next_idx (time_next, priority), but explain showed it bypassing it to do a full table scan. After adding a new index KEY priority__time_next (priority, time_next DESC), it started processing again and ran great.

Fast forward a few weeks, I hadn't gone through and added that same index to all our other message tables, and I had queues not processing, and were otherwise behaving just like the original deadlocking issue here, where > 50% of shards would fail to reparent, stopping in the same place in the logs as before. After adding that index to all the tables, the "deadlocks" / "blocking" stopped happening, and now I can't seem to reproduce it.

Query setup
select priority, time_next, epoch, time_acked, other_cols from msg_tbl where time_next < NOW()*1e9 order by priority, time_next desc limit 10000
https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletserver/messager/message_manager.go#L254-L256

Vstream execution
https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletserver/messager/message_manager.go#L910-L921

CREATE TABLE `msg_tbl` (
  `id` bigint NOT NULL,
  `keyspace_id` bigint NOT NULL,
  `priority` tinyint NOT NULL DEFAULT '50',
  `epoch` smallint NOT NULL DEFAULT '0',
  `time_next` bigint DEFAULT NULL,
  `time_acked` bigint DEFAULT NULL,
  `time_scheduled` bigint NOT NULL,
  `time_created` bigint NOT NULL,
  `attributes` json DEFAULT NULL,
  `data` json NOT NULL,
  PRIMARY KEY (`id`),
  KEY `ack_idx` (`time_acked`),
  KEY `next_idx` (`time_next`,`priority`),
  KEY `priority__time_next` (`priority`,`time_next` DESC) # newly added index
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED COMMENT='vitess_message,vt_ack_wait=30,vt_min_backoff=30,vt_max_backoff=600,vt_purge_after=604800,vt_batch_size=10,vt_cache_size=10000,vt_poller_interval=30'

As I'm typing this, I'm remembering that at least some of the time while the large queues were blocked, they would still process streaming/ongoing messages, so maybe polling backed up messages wasn't working, but messages would come through the live stream... I'm not super clear exactly how the vstream mechanism works, so I'm not sure my mental model is correct.

@derekperkins
Copy link
Member Author

I'm wondering if what was actually happening was the same as #9827. The query timed out because of the full table scan, then deadlocked. In your list of possible culprits, 2 of them are inside that applyEvents function. Hopefully that's the case, which means it's already fixed. :)

@vmg
Copy link
Collaborator

vmg commented Mar 10, 2022

Ah yes, @rohit-nayak-ps's #9833 could indeed be a fix for this. Worth trying in production before we dig further.

@derekperkins
Copy link
Member Author

I deployed yesterday from this branch, which has Rohit's commit from #9833, but today still saw it deadlock and stop at the same log entry on 10 of 16 shards in one keyspace.

I0317 04:56:50.637023    9434 rpc_replication.go:388] DemotePrimary
I0317 04:56:50.638689    9434 rpc_replication.go:438] DemotePrimary disabling query service
I0317 04:56:50.638703    9434 state_manager.go:214] Starting transition to PRIMARY Not Serving, timestamp: 2022-03-17 04:25:38.205679188 +0000 UTC
I0317 04:56:50.638773    9434 tablegc.go:212] TableGC: closing

I dumped the debugging profiles again, and this time let it sit for longer, in case I missed any blocking profiles last time.
https://gist.github.com/derekperkins/dd6d54809a98b582c03909061e639766

Trying to eliminate the noise, these are very similar to the ones you called out earlier.

5015850765396 102619 @ 0x47158c 0x11fc23e 0x1212b1a 0x1211bb2 0x469301
#	0x47158b	sync.(*Cond).Wait+0x8b										sync/cond.go:56
#	0x11fc23d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Fetch+0x13d		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:98
#	0x1212b19	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x219		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:346
#	0x1211bb1	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

3113077 63 @ 0x11fbe65 0x11fbe3e 0x1211d4e 0x136b910 0x11fb369 0x1211cd1 0x469301
#	0x11fbe64	sync.(*Mutex).Lock+0x84											sync/mutex.go:81
#	0x11fbe3d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Send+0x5d			vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:69
#	0x1211d4d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func1.1+0x2d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:149
#	0x136b90f	vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).VStream+0xef				vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:657
#	0x11fb368	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*tabletConnector).VStream+0x68		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/external_connector.go:174
#	0x1211cd0	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func1+0xf0		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:148

2170168 9 @ 0x47305b 0x47159e 0x11fc23e 0x1212b1a 0x1211bb2 0x469301
#	0x47305a	sync.(*Mutex).Lock+0x3a										sync/mutex.go:81
#	0x47159d	sync.(*Cond).Wait+0x9d										sync/cond.go:57
#	0x11fc23d	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Fetch+0x13d		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:98
#	0x1212b19	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x219		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:346
#	0x1211bb1	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

518083 14 @ 0x127ae8c 0x127ae48 0x1289174 0x1289048 0x125bbb9 0x469301
#	0x127ae8b	sync.(*Mutex).Lock+0x8b										sync/mutex.go:81
#	0x127ae47	vitess.io/vitess/go/vt/vttablet/tabletserver.(*stateManager).StartRequest+0x47			vitess.io/vitess/go/vt/vttablet/tabletserver/state_manager.go:348
#	0x1289173	vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).execDML+0x73			vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:1031
#	0x1289047	vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).PurgeMessages+0x67			vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:1024
#	0x125bbb8	vitess.io/vitess/go/vt/vttablet/tabletserver/messager.(*messageManager).runPurge.func1+0x178	vitess.io/vitess/go/vt/vttablet/tabletserver/messager/message_manager.go:802

142694 3 @ 0xf9a818 0xf9a7f6 0x1217e25 0x1214351 0x121314b 0x1211bb2 0x469301
#	0xf9a817	sync.(*Mutex).Lock+0x77										sync/mutex.go:81
#	0xf9a7f5	vitess.io/vitess/go/streamlog.(*StreamLogger).Send+0x55						vitess.io/vitess/go/streamlog/streamlog.go:95
#	0x1217e24	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*VrLogStats).Send+0x164		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vrlog.go:62
#	0x1214350	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvent+0xe70		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:500
#	0x121314a	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x84a		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:399
#	0x1211bb1	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

89707 4 @ 0x11fc165 0x11fc144 0x1212b1a 0x1211bb2 0x469301
#	0x11fc164	sync.(*Mutex).Lock+0x64										sync/mutex.go:81
#	0x11fc143	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*relayLog).Fetch+0x43		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/relaylog.go:89
#	0x1212b19	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x219		vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:346
#	0x1211bb1	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x31	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:155

28755 2 @ 0x9a2932 0x9a2908 0x9b1cb5 0x9b1c78 0x9acc27 0x9c431f 0x4a8d7a 0x9c4245 0x9c421b 0x9eef0b 0x9ef6e6 0x9efd8e 0xa013f0 0xa007a5 0x9ff636 0xa004df 0xb28637 0x1352a8c 0x136b8b9 0x11fb369 0x1211cd1 0x469301
#	0x9a2931	sync.(*Mutex).Lock+0x51										sync/mutex.go:81
#	0x9a2907	google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut+0x27			google.golang.org/grpc@v1.39.0/internal/transport/controlbuf.go:329
#	0x9b1cb4	google.golang.org/grpc/internal/transport.(*controlBuffer).put+0x74				google.golang.org/grpc@v1.39.0/internal/transport/controlbuf.go:323
#	0x9b1c77	google.golang.org/grpc/internal/transport.(*http2Client).updateWindow+0x37			google.golang.org/grpc@v1.39.0/internal/transport/http2_client.go:969
#	0x9acc26	google.golang.org/grpc/internal/transport.(*http2Client).newStream.func3+0x26			google.golang.org/grpc@v1.39.0/internal/transport/http2_client.go:442
#	0x9c431e	google.golang.org/grpc/internal/transport.(*transportReader).Read+0x7e				google.golang.org/grpc@v1.39.0/internal/transport/transport.go:488
#	0x4a8d79	io.ReadAtLeast+0x99										io/io.go:328
#	0x9c4244	io.ReadFull+0xa4										io/io.go:347
#	0x9c421a	google.golang.org/grpc/internal/transport.(*Stream).Read+0x7a					google.golang.org/grpc@v1.39.0/internal/transport/transport.go:467
#	0x9eef0a	google.golang.org/grpc.(*parser).recvMsg+0xca							google.golang.org/grpc@v1.39.0/rpc_util.go:578
#	0x9ef6e5	google.golang.org/grpc.recvAndDecompress+0x65							google.golang.org/grpc@v1.39.0/rpc_util.go:690
#	0x9efd8d	google.golang.org/grpc.recv+0x6d								google.golang.org/grpc@v1.39.0/rpc_util.go:758
#	0xa013ef	google.golang.org/grpc.(*csAttempt).recvMsg+0x2af						google.golang.org/grpc@v1.39.0/stream.go:957
#	0xa007a4	google.golang.org/grpc.(*clientStream).RecvMsg.func1+0x24					google.golang.org/grpc@v1.39.0/stream.go:808
#	0x9ff635	google.golang.org/grpc.(*clientStream).withRetry+0x2f5						google.golang.org/grpc@v1.39.0/stream.go:662
#	0xa004de	google.golang.org/grpc.(*clientStream).RecvMsg+0x11e						google.golang.org/grpc@v1.39.0/stream.go:807
#	0xb28636	github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredClientStream).RecvMsg+0x36		github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/client_metrics.go:160
#	0x1352a8b	vitess.io/vitess/go/vt/proto/queryservice.(*queryVStreamClient).Recv+0x4b			vitess.io/vitess/go/vt/proto/queryservice/queryservice_grpc.pb.go:461
#	0x136b8b8	vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).VStream+0x98			vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:648
#	0x11fb368	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*tabletConnector).VStream+0x68	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/external_connector.go:174
#	0x1211cd0	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func1+0xf0	vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:148

@derekperkins
Copy link
Member Author

I'm not sure what is happening, but this has been exponentially worse since rolling to v13. I can't keep our cluster up and running. Here's a new batch of dumps: this time I took them from a tablet where messages were no longer being delivered, but before calling PlannedReparentShard where the tablet gets stuck shutting down. I also included a full log dump
https://gist.github.com/derekperkins/22a33c2b6bd77c000cc3001c5dd6bd09

@vmg
Copy link
Collaborator

vmg commented Mar 21, 2022

@deepthi @rohit-nayak-ps: we need to talk about this in the weekly sync. There's a regression in v13 and it's not related to the previous bug at all. It looks like we just introduced a different issue here.

@mattlord
Copy link
Contributor

I'm re-closing this and moved the ongoing discussion/work here: #9936

There's a lot of history and different info in this issue that is making it harder to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment