Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

darenseagrave · 2022-03-03T11:26:07Z

Overview of the Issue

A race condition between the reparenting code and the replManager can cause a newly promoted PRIMARY to change tablet type to REPLICA and reconnect to the old PRIMARY causing chained replication as well as no PRIMARY in the shard.

This is caused by repairReplication being called which runs:

        // Change our type to REPLICA if we used to be PRIMARY.
        // Being sent SetReplicationSource means another PRIMARY has been successfully promoted,
        // so we convert to REPLICA first, since we want to do it even if other
        // steps fail below.
        // Note it is important to check for PRIMARY here so that we don't
        // unintentionally change the type of RDONLY tablets
        tablet := tm.Tablet()
        if tablet.Type == topodatapb.TabletType_PRIMARY {
                if err := tm.tmState.ChangeTabletType(ctx, topodatapb.TabletType_REPLICA, DBActionNone); err != nil {
                        return err
                }
        }

In the following code:

func (ts *tmState) ChangeTabletType(ctx context.Context, tabletType topodatapb.TabletType, action DBAction)

We change to PRIMARY and at the very last step call:

ts.updateLocked(ctx)

At this point it calls:

ts.tm.replManager.SetTabletType(ts.tablet.Type)

Which then:

        if !topo.IsReplicaType(tabletType) {
                rm.ticks.Stop()
                return
        }

If however, you run a very low ticks interval, you risk triggering this logic while/during the PlannedReparentShard call.

Either replManager needs to be stopped pre-reparent on the newly elected PRIMARY, told sooner (via ts.tm.replManager.SetTabletType) or using locking to prevent replManager doing anything until the ChangeTabletType can call its SetTabletType function.

Reproduction Steps

Run a low ticks interval on `replManager`, perform many reparents (50,000+) until you trigger the race condition, we see it roughly every 2-3,000 reparent operations performed.

Binary Version

All versions

Operating System and Environment details

Ubuntu

Log Fragments

I0302 04:41:22.165654   31100 rpc_replication.go:500] SetMaster: parent: cell:"us_east_1b" uid:300070247   position: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670185,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035 force: true
 I0302 04:41:22.167869   31100 replication.go:467] Setting semi-sync mode: master=false, replica=true
 I0302 04:41:22.168433   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:22.179560   31100 rpc_server.go:84] TabletManager.SetMaster(parent:<cell:"us_east_1b" uid:300070247 > force_start_replication:true wait_position:"MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670185,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035" )(on us_east_1a-0300010191 from ): (*tabletmanagerdata.SetMasterResponse)(nil)
 I0302 04:41:23.171746   31100 rpc_replication.go:74] WaitForPosition: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670186,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035
 I0302 04:41:23.173059   31100 rpc_replication.go:753] PromoteReplica
 I0302 04:41:23.173416   31100 query.go:81] exec STOP SLAVE
 I0302 04:41:23.187483   31100 query.go:81] exec RESET SLAVE ALL
 I0302 04:41:23.201981   31100 query.go:81] exec FLUSH BINARY LOGS
 I0302 04:41:23.216953   31100 replication.go:467] Setting semi-sync mode: master=true, replica=true
 I0302 04:41:23.217159   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 1, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:23.217577   31100 tm_state.go:176] Changing Tablet Type: MASTER
 I0302 04:41:23.234885   31100 query.go:81] exec SET GLOBAL read_only = OFF
 I0302 04:41:23.237395   31100 engine.go:186] VReplication Engine: opening
 I0302 04:41:23.240404   31100 state_manager.go:212] Starting transition to MASTER Serving, timestamp: 2022-03-02 12:41:23.217844915 +0000 UTC
 I0302 04:41:23.241793   31100 repltracker.go:86] Replication Tracker: going into master mode
 I0302 04:41:23.242156   31100 tx_engine.go:156] TxEngine transition: AcceptReadWrite
 I0302 04:41:23.242384   31100 tx_engine.go:306] Immediate shutdown: rolling back now.
 I0302 04:41:23.242651   31100 stateful_connection_pool.go:79] Starting transaction id: {1644654709061614481}
 I0302 04:41:23.242964   31100 engine.go:82] Messager: opening
 I0302 04:41:23.243633   31100 tablegc.go:171] TableGC: opening
 I0302 04:41:23.246167   31100 state_manager.go:561] TabletServer transition: REPLICA: Serving -> MASTER: Serving, Mar 2, 2022 at 04:41:23 (PST) for tablet :keyspace/shard
 I0302 04:41:23.246361   31100 state_manager.go:661] State: exiting lameduck
 I0302 04:41:23.246618   31100 metadata_tables.go:183] Populating _vt.local_metadata table...
 I0302 04:41:23.380246   31100 tm_state.go:358] Publishing state: alias:<cell:"us_east_1a" uid:300010191 > hostname:"tablet-keyspace-shard-y7tj" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"keyspace" shard:"shard" key_range:<start:"\021\020" end:"\021\021" > type:MASTER mysql_hostname:"tablet-keyspace-shard-y7tj" mysql_port:3306 master_term_start_time:<seconds:1646224883 nanoseconds:217844915 >
 I0302 04:41:23.389805   31100 rpc_server.go:84] TabletManager.PromoteReplica()(on us_east_1a-0300010191 from ): (*tabletmanagerdata.PromoteReplicaResponse)(nil)
 I0302 04:41:23.389897   31100 shard_sync.go:70] Change to tablet state
 I0302 04:41:23.391123   31100 rpc_replication.go:271] PopulateReparentJournal: action: PlannedReparentShard parent: cell:"us_east_1a" uid:300010191   position: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670186,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035
 I0302 04:41:23.392767   31100 query.go:81] exec CREATE DATABASE IF NOT EXISTS _vt
 I0302 04:41:23.488014   31100 query.go:81] exec CREATE TABLE IF NOT EXISTS _vt.reparent_journal (
   time_created_ns BIGINT UNSIGNED NOT NULL,
   action_name VARBINARY(250) NOT NULL,
   master_alias VARBINARY(32) NOT NULL,
   replication_position VARBI
 I0302 04:41:23.496922   31100 query.go:81] exec INSERT INTO _vt.reparent_journal (time_created_ns, action_name, master_alias, replication_position) VALUES (1646224883390435233, 'PlannedReparentShard', 'us_east_1a-0300010191', 'MySQL56/05f7f9b6-3ffa
 I0302 04:41:24.212500   31100 tablegc.go:231] TableGC: transition into primary
 I0302 04:41:24.390756   31100 replmanager.go:118] Replication is stopped, reconnecting to master.
 I0302 04:41:25.001714   31100 shard_sync.go:175] Updating shard record: master_alias=us_east_1a-0300010191, master_term_start_time=2022-03-02 12:41:23.217844915 +0000 UTC
 I0302 04:41:25.009070   31100 shard_watcher.go:39] Starting shard watch of keyspace/shard
 I0302 04:41:25.019842   31100 locks.go:355] Locking shard keyspace/shard for action repairReplication to us_east_1b-0300070247 as parent)
 I0302 04:41:25.038288   31100 tm_state.go:176] Changing Tablet Type: REPLICA
 I0302 04:41:25.039026   31100 engine.go:300] VReplication Engine: closed
 I0302 04:41:25.039055   31100 state_manager.go:212] Starting transition to REPLICA Serving, timestamp: 0001-01-01 00:00:00 +0000 UTC
 I0302 04:41:25.039106   31100 tablegc.go:191] TableGC: closing
 I0302 04:41:25.039124   31100 engine.go:102] Messager: closed
 I0302 04:41:25.040088   31100 tx_engine.go:156] TxEngine transition: AcceptingReadOnly
 I0302 04:41:25.040149   31100 tx_engine.go:322] No grace period specified: performing normal wait.
 I0302 04:41:25.042703   31100 stateful_connection_pool.go:79] Starting transaction id: {1644654709061614778}
 I0302 04:41:25.042970   31100 repltracker.go:102] Replication Tracker: going into non-master mode a
 I0302 04:41:25.044453   31100 state_manager.go:561] TabletServer transition: MASTER: Serving, Dec 31, 0000 at 16:07:02 (LMT) -> REPLICA: Serving for tablet :keyspace/shard
 I0302 04:41:25.044488   31100 state_manager.go:661] State: exiting lameduck
 I0302 04:41:25.044520   31100 metadata_tables.go:217] Upserting _vt.local_metadata ...
 I0302 04:41:25.045484   31100 state_manager.go:626] Going unhealthy due to replication error: no replication status
 I0302 04:41:25.055213   31100 tm_state.go:358] Publishing state: alias:<cell:"us_east_1a" uid:300010191 > hostname:"tablet-keyspace-shard-y7tj" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"keyspace" shard:"shard" key_range:<start:"\021\020" end:"\021\021" > type:REPLICA mysql_hostname:"tablet-keyspace-shard-y7tj" mysql_port:3306
 I0302 04:41:25.084628   31100 shard_sync.go:70] Change to tablet state
 I0302 04:41:25.084672   31100 shard_watcher.go:56] Stopping shard watch...
 I0302 04:41:25.085061   31100 shard_watcher.go:62] Shard watch stopped.
 I0302 04:41:25.085468   31100 replication.go:467] Setting semi-sync mode: master=false, replica=true
 I0302 04:41:25.085735   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:26.070417   31100 query.go:81] exec CHANGE MASTER TO
   MASTER_HOST = 'tablet-keyspace-shard-355a',
   MASTER_PORT = 3306,
   MASTER_USER = 'repl_username',
   MASTER_PASSWORD = '************',
   MASTER_CONNECT_RETRY = 10,
   MASTER_AUTO_POSITI
 I0302 04:41:26.075087   31100 query.go:81] exec START SLAVE
 I0302 04:41:26.076268   31100 locks.go:395] Unlocking shard keyspace/shard for successful action repairReplication to us_east_1b-0300070247 as parent)
 I0302 04:41:26.095803   31100 replmanager.go:129] Successfully reconnected to master.
 I0302 04:41:27.047427   31100 state_manager.go:637] Replication is healthy

The text was updated successfully, but these errors were encountered:

deepthi · 2022-03-07T19:01:04Z

@darenseagrave can you give us the specific Vitess version / commit SHA you ran into this with?
Alternatively, can you confirm this is still an issue on main with the latest commit SHA? The code snippet you provided tells me that this is an older version.

darenseagrave · 2022-03-07T19:20:44Z

Hi @deepthi, we've hit the bug in Vitess 10 but I've confirmed this is also an issue on main, the log snippet is from our branch but the code snippets I referenced are from main

deepthi · 2022-03-07T19:58:56Z

How low is your ticks interval set to?
You've answered my other question - it seems that you ran into this during normal operation and the 50k+ calls to PRS are merely to reproduce the issue. We will look into it.

/assign @vitessio/cluster-management

derekperkins · 2022-03-07T23:57:29Z

This happened to us over the weekend. I don't have logs to prove that it was this specifically, but after we had some flapping on a specific shard (this has resurfaced for us #8909), we ended up with no primaries

GuptaManan100 · 2022-03-08T15:59:01Z

Checked the code and the order of operations for PromoteReplica are -

Fix MySQL by resetting replication and fix semi-sync settings
Change Tablet type in the topo server (tablet record)
update the local state of tablet manager to Primary.
Stop replication manager
Start a loop of shard sync which will update the topo server (shard record)
Start replication

So any run of replication manager tick after the first step and before the 4th step is going to cause this issue. A possible fix is to stop the replication manager much sooner in the PromoteReplica steps. WDYT @deepthi?

deepthi · 2022-03-09T18:28:43Z

I think the very first thing we should do in PromoteReplica is to stop the replication manager. We may need add a defer call to restart it if we fail to promote and return an error.

darenseagrave added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Mar 3, 2022

deepthi added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 7, 2022

deepthi self-assigned this Mar 7, 2022

GuptaManan100 self-assigned this Mar 8, 2022

GuptaManan100 mentioned this issue Mar 10, 2022

Fix the race between PromoteReplica and replication manager tick #9859

Merged

3 tasks

GuptaManan100 closed this as completed in #9859 Mar 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

darenseagrave commented Mar 3, 2022 •

edited

Loading

deepthi commented Mar 7, 2022

darenseagrave commented Mar 7, 2022

deepthi commented Mar 7, 2022

derekperkins commented Mar 7, 2022

GuptaManan100 commented Mar 8, 2022

deepthi commented Mar 9, 2022 •

edited

Loading

Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

Comments

darenseagrave commented Mar 3, 2022 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Mar 7, 2022

darenseagrave commented Mar 7, 2022

deepthi commented Mar 7, 2022

derekperkins commented Mar 7, 2022

GuptaManan100 commented Mar 8, 2022

deepthi commented Mar 9, 2022 • edited Loading

darenseagrave commented Mar 3, 2022 •

edited

Loading

deepthi commented Mar 9, 2022 •

edited

Loading