Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819

Closed
darenseagrave opened this issue Mar 3, 2022 · 6 comments · Fixed by #9859
Closed

Comments

@darenseagrave
Copy link

darenseagrave commented Mar 3, 2022

Overview of the Issue

A race condition between the reparenting code and the replManager can cause a newly promoted PRIMARY to change tablet type to REPLICA and reconnect to the old PRIMARY causing chained replication as well as no PRIMARY in the shard.

This is caused by repairReplication being called which runs:

        // Change our type to REPLICA if we used to be PRIMARY.
        // Being sent SetReplicationSource means another PRIMARY has been successfully promoted,
        // so we convert to REPLICA first, since we want to do it even if other
        // steps fail below.
        // Note it is important to check for PRIMARY here so that we don't
        // unintentionally change the type of RDONLY tablets
        tablet := tm.Tablet()
        if tablet.Type == topodatapb.TabletType_PRIMARY {
                if err := tm.tmState.ChangeTabletType(ctx, topodatapb.TabletType_REPLICA, DBActionNone); err != nil {
                        return err
                }
        }

In the following code:

func (ts *tmState) ChangeTabletType(ctx context.Context, tabletType topodatapb.TabletType, action DBAction)

We change to PRIMARY and at the very last step call:

ts.updateLocked(ctx)

At this point it calls:

ts.tm.replManager.SetTabletType(ts.tablet.Type)

Which then:

        if !topo.IsReplicaType(tabletType) {
                rm.ticks.Stop()
                return
        }

If however, you run a very low ticks interval, you risk triggering this logic while/during the PlannedReparentShard call.

Either replManager needs to be stopped pre-reparent on the newly elected PRIMARY, told sooner (via ts.tm.replManager.SetTabletType) or using locking to prevent replManager doing anything until the ChangeTabletType can call its SetTabletType function.

Reproduction Steps

Run a low ticks interval on `replManager`, perform many reparents (50,000+) until you trigger the race condition, we see it roughly every 2-3,000 reparent operations performed.

Binary Version

All versions

Operating System and Environment details

Ubuntu

Log Fragments

I0302 04:41:22.165654   31100 rpc_replication.go:500] SetMaster: parent: cell:"us_east_1b" uid:300070247   position: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670185,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035 force: true
 I0302 04:41:22.167869   31100 replication.go:467] Setting semi-sync mode: master=false, replica=true
 I0302 04:41:22.168433   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:22.179560   31100 rpc_server.go:84] TabletManager.SetMaster(parent:<cell:"us_east_1b" uid:300070247 > force_start_replication:true wait_position:"MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670185,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035" )(on us_east_1a-0300010191 from ): (*tabletmanagerdata.SetMasterResponse)(nil)
 I0302 04:41:23.171746   31100 rpc_replication.go:74] WaitForPosition: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670186,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035
 I0302 04:41:23.173059   31100 rpc_replication.go:753] PromoteReplica
 I0302 04:41:23.173416   31100 query.go:81] exec STOP SLAVE
 I0302 04:41:23.187483   31100 query.go:81] exec RESET SLAVE ALL
 I0302 04:41:23.201981   31100 query.go:81] exec FLUSH BINARY LOGS
 I0302 04:41:23.216953   31100 replication.go:467] Setting semi-sync mode: master=true, replica=true
 I0302 04:41:23.217159   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 1, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:23.217577   31100 tm_state.go:176] Changing Tablet Type: MASTER
 I0302 04:41:23.234885   31100 query.go:81] exec SET GLOBAL read_only = OFF
 I0302 04:41:23.237395   31100 engine.go:186] VReplication Engine: opening
 I0302 04:41:23.240404   31100 state_manager.go:212] Starting transition to MASTER Serving, timestamp: 2022-03-02 12:41:23.217844915 +0000 UTC
 I0302 04:41:23.241793   31100 repltracker.go:86] Replication Tracker: going into master mode
 I0302 04:41:23.242156   31100 tx_engine.go:156] TxEngine transition: AcceptReadWrite
 I0302 04:41:23.242384   31100 tx_engine.go:306] Immediate shutdown: rolling back now.
 I0302 04:41:23.242651   31100 stateful_connection_pool.go:79] Starting transaction id: {1644654709061614481}
 I0302 04:41:23.242964   31100 engine.go:82] Messager: opening
 I0302 04:41:23.243633   31100 tablegc.go:171] TableGC: opening
 I0302 04:41:23.246167   31100 state_manager.go:561] TabletServer transition: REPLICA: Serving -> MASTER: Serving, Mar 2, 2022 at 04:41:23 (PST) for tablet :keyspace/shard
 I0302 04:41:23.246361   31100 state_manager.go:661] State: exiting lameduck
 I0302 04:41:23.246618   31100 metadata_tables.go:183] Populating _vt.local_metadata table...
 I0302 04:41:23.380246   31100 tm_state.go:358] Publishing state: alias:<cell:"us_east_1a" uid:300010191 > hostname:"tablet-keyspace-shard-y7tj" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"keyspace" shard:"shard" key_range:<start:"\021\020" end:"\021\021" > type:MASTER mysql_hostname:"tablet-keyspace-shard-y7tj" mysql_port:3306 master_term_start_time:<seconds:1646224883 nanoseconds:217844915 >
 I0302 04:41:23.389805   31100 rpc_server.go:84] TabletManager.PromoteReplica()(on us_east_1a-0300010191 from ): (*tabletmanagerdata.PromoteReplicaResponse)(nil)
 I0302 04:41:23.389897   31100 shard_sync.go:70] Change to tablet state
 I0302 04:41:23.391123   31100 rpc_replication.go:271] PopulateReparentJournal: action: PlannedReparentShard parent: cell:"us_east_1a" uid:300010191   position: MySQL56/05f7f9b6-3ffa-11eb-8359-06e67fc85c0b:1-34553431,074ed07d-b7ea-11ea-855b-12b9ead9b8d5:1-94816103,106249bb-5202-11ec-aa69-0eb89e4a547b:1-95325967,181c54d0-b7ea-11ea-bd33-0adbcb2ea71d:1-150899991,51015fba-2425-11ec-a521-0a74fe16d485:1-65074818,6812ea1c-20a9-11ec-80cf-0686595f79e1:1-104961644,7367f738-a270-11eb-bf82-0a55b82c2c31:1-398918951,c2457e06-a20c-11eb-81f2-0e3d9df8a2cf:1-201451504,ccd2a388-b7e9-11ea-b5bb-0e1491fb22a3:1-495049663,df8e974a-2295-11ec-ae98-12053be46ec9:1-143090904,e3e08b56-6e90-11ec-aee1-0a7c54116979:1-273670186,f828a151-b7f3-11ea-aeb2-06ab19f93555:1-25727035
 I0302 04:41:23.392767   31100 query.go:81] exec CREATE DATABASE IF NOT EXISTS _vt
 I0302 04:41:23.488014   31100 query.go:81] exec CREATE TABLE IF NOT EXISTS _vt.reparent_journal (
   time_created_ns BIGINT UNSIGNED NOT NULL,
   action_name VARBINARY(250) NOT NULL,
   master_alias VARBINARY(32) NOT NULL,
   replication_position VARBI
 I0302 04:41:23.496922   31100 query.go:81] exec INSERT INTO _vt.reparent_journal (time_created_ns, action_name, master_alias, replication_position) VALUES (1646224883390435233, 'PlannedReparentShard', 'us_east_1a-0300010191', 'MySQL56/05f7f9b6-3ffa
 I0302 04:41:24.212500   31100 tablegc.go:231] TableGC: transition into primary
 I0302 04:41:24.390756   31100 replmanager.go:118] Replication is stopped, reconnecting to master.
 I0302 04:41:25.001714   31100 shard_sync.go:175] Updating shard record: master_alias=us_east_1a-0300010191, master_term_start_time=2022-03-02 12:41:23.217844915 +0000 UTC
 I0302 04:41:25.009070   31100 shard_watcher.go:39] Starting shard watch of keyspace/shard
 I0302 04:41:25.019842   31100 locks.go:355] Locking shard keyspace/shard for action repairReplication to us_east_1b-0300070247 as parent)
 I0302 04:41:25.038288   31100 tm_state.go:176] Changing Tablet Type: REPLICA
 I0302 04:41:25.039026   31100 engine.go:300] VReplication Engine: closed
 I0302 04:41:25.039055   31100 state_manager.go:212] Starting transition to REPLICA Serving, timestamp: 0001-01-01 00:00:00 +0000 UTC
 I0302 04:41:25.039106   31100 tablegc.go:191] TableGC: closing
 I0302 04:41:25.039124   31100 engine.go:102] Messager: closed
 I0302 04:41:25.040088   31100 tx_engine.go:156] TxEngine transition: AcceptingReadOnly
 I0302 04:41:25.040149   31100 tx_engine.go:322] No grace period specified: performing normal wait.
 I0302 04:41:25.042703   31100 stateful_connection_pool.go:79] Starting transaction id: {1644654709061614778}
 I0302 04:41:25.042970   31100 repltracker.go:102] Replication Tracker: going into non-master mode a
 I0302 04:41:25.044453   31100 state_manager.go:561] TabletServer transition: MASTER: Serving, Dec 31, 0000 at 16:07:02 (LMT) -> REPLICA: Serving for tablet :keyspace/shard
 I0302 04:41:25.044488   31100 state_manager.go:661] State: exiting lameduck
 I0302 04:41:25.044520   31100 metadata_tables.go:217] Upserting _vt.local_metadata ...
 I0302 04:41:25.045484   31100 state_manager.go:626] Going unhealthy due to replication error: no replication status
 I0302 04:41:25.055213   31100 tm_state.go:358] Publishing state: alias:<cell:"us_east_1a" uid:300010191 > hostname:"tablet-keyspace-shard-y7tj" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"keyspace" shard:"shard" key_range:<start:"\021\020" end:"\021\021" > type:REPLICA mysql_hostname:"tablet-keyspace-shard-y7tj" mysql_port:3306
 I0302 04:41:25.084628   31100 shard_sync.go:70] Change to tablet state
 I0302 04:41:25.084672   31100 shard_watcher.go:56] Stopping shard watch...
 I0302 04:41:25.085061   31100 shard_watcher.go:62] Shard watch stopped.
 I0302 04:41:25.085468   31100 replication.go:467] Setting semi-sync mode: master=false, replica=true
 I0302 04:41:25.085735   31100 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0302 04:41:26.070417   31100 query.go:81] exec CHANGE MASTER TO
   MASTER_HOST = 'tablet-keyspace-shard-355a',
   MASTER_PORT = 3306,
   MASTER_USER = 'repl_username',
   MASTER_PASSWORD = '************',
   MASTER_CONNECT_RETRY = 10,
   MASTER_AUTO_POSITI
 I0302 04:41:26.075087   31100 query.go:81] exec START SLAVE
 I0302 04:41:26.076268   31100 locks.go:395] Unlocking shard keyspace/shard for successful action repairReplication to us_east_1b-0300070247 as parent)
 I0302 04:41:26.095803   31100 replmanager.go:129] Successfully reconnected to master.
 I0302 04:41:27.047427   31100 state_manager.go:637] Replication is healthy
@darenseagrave darenseagrave added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Mar 3, 2022
@deepthi deepthi added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 7, 2022
@deepthi
Copy link
Member

deepthi commented Mar 7, 2022

@darenseagrave can you give us the specific Vitess version / commit SHA you ran into this with?
Alternatively, can you confirm this is still an issue on main with the latest commit SHA? The code snippet you provided tells me that this is an older version.

@deepthi deepthi self-assigned this Mar 7, 2022
@darenseagrave
Copy link
Author

Hi @deepthi, we've hit the bug in Vitess 10 but I've confirmed this is also an issue on main, the log snippet is from our branch but the code snippets I referenced are from main

@deepthi
Copy link
Member

deepthi commented Mar 7, 2022

How low is your ticks interval set to?
You've answered my other question - it seems that you ran into this during normal operation and the 50k+ calls to PRS are merely to reproduce the issue. We will look into it.

/assign @vitessio/cluster-management

@derekperkins
Copy link
Member

This happened to us over the weekend. I don't have logs to prove that it was this specifically, but after we had some flapping on a specific shard (this has resurfaced for us #8909), we ended up with no primaries

@GuptaManan100
Copy link
Member

Checked the code and the order of operations for PromoteReplica are -

  1. Fix MySQL by resetting replication and fix semi-sync settings
  2. Change Tablet type in the topo server (tablet record)
  3. update the local state of tablet manager to Primary.
  4. Stop replication manager
  5. Start a loop of shard sync which will update the topo server (shard record)
  6. Start replication

So any run of replication manager tick after the first step and before the 4th step is going to cause this issue. A possible fix is to stop the replication manager much sooner in the PromoteReplica steps. WDYT @deepthi?

@GuptaManan100 GuptaManan100 self-assigned this Mar 8, 2022
@deepthi
Copy link
Member

deepthi commented Mar 9, 2022

I think the very first thing we should do in PromoteReplica is to stop the replication manager. We may need add a defer call to restart it if we fail to promote and return an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants