Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: upgrade to v16 from v15 blocks on semi-sync #13426

Closed
deepthi opened this issue Jun 30, 2023 · 2 comments · Fixed by #13440
Closed

Bug Report: upgrade to v16 from v15 blocks on semi-sync #13426

deepthi opened this issue Jun 30, 2023 · 2 comments · Fixed by #13440

Comments

@deepthi
Copy link
Member

deepthi commented Jun 30, 2023

Overview of the Issue

We upgraded a large database to v16 recently. During the rollout, errors were served to the app for ~30 seconds.
The root cause seems to be that the upgrade of _vt schema during PlannedReparent was blocked by semi-sync.

Reproduction Steps

Upgrade a 3+ tablet cluster with semi-sync enabled from v15 to v16.

Binary Version

16.0.0+

Operating System and Environment details

Any

Log Fragments

2023-06-26 20:33:46.339	
I0626 20:33:46.338969       1 replication.go:586] Setting semi-sync mode: primary=true, replica=true
2023-06-26 20:33:46.339	
I0626 20:33:46.339255       1 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 1, GLOBAL rpl_semi_sync_slave_enabled = 1
2023-06-26 20:33:46.339	
I0626 20:33:46.339689       1 tm_state.go:186] Changing Tablet Type: PRIMARY for cell:"redacted" uid:redacted
2023-06-26 20:33:46.358	
I0626 20:33:46.357886       1 syslogger.go:129] <redacted> [tablet] updated
2023-06-26 20:33:46.371	
I0626 20:33:46.371122       1 sidecardb.go:408] Applying DDL for table views:
2023-06-26 20:33:46.371	
CREATE TABLE IF NOT EXISTS `_vt`.`views` (
2023-06-26 20:33:46.371	
	`TABLE_SCHEMA` varchar(64) NOT NULL,
2023-06-26 20:33:46.371	
	`TABLE_NAME` varchar(64) NOT NULL,
2023-06-26 20:33:46.371	
	`CREATE_STATEMENT` longtext NOT NULL,
2023-06-26 20:33:46.371	
	`UPDATED_AT` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
2023-06-26 20:33:46.371	
	PRIMARY KEY (`TABLE_SCHEMA`, `TABLE_NAME`)
2023-06-26 20:33:46.371	
) ENGINE InnoDB
2023-06-26 20:33:48.215	
I0626 20:33:48.215033       1 state_manager.go:682] Going unhealthy due to replication error: no replication status (errno 100) (sqlstate HY000)
2023-06-26 20:34:16.356	
I0626 20:34:16.356183       1 sidecardb.go:357] createSidecarDB: _vt
@deepthi deepthi added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: Cluster management labels Jun 30, 2023
@deepthi
Copy link
Member Author

deepthi commented Jun 30, 2023

The fix for this will need to be back ported to v16 and v17 and we'll need to do patch releases.

@deepthi
Copy link
Member Author

deepthi commented Jun 30, 2023

I0626 20:33:46.371122       1 sidecardb.go:408] Applying DDL for table views:
I0626 20:34:16.356183       1 sidecardb.go:357] createSidecarDB: _vt

There is a 30 second gap here. What seems to have happened is that because we enable semi-sync before transitioning the tablet to primary, the creation of _vt schema gets blocked by semi-sync. We point replicas to the new primary only after the transition to primary so there is no tablet available to ACK the write. In the meantime, vtorc detects that the replicas are pointing to the wrong (old) primary, but can't do anything because of the shard lock being held by PRS. At the end of 30 seconds, the lock times out, vtorc fixes replication, and the DDL can proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant