Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LOCK TABLES during SwitchWrites to prevent lost writes #9481

Merged
merged 6 commits into from
Jan 14, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Jan 6, 2022

Description

ℹ️ This is not an issue for Reshard workflows because open transactions are closed (either by waiting for a grace period or finally cancelled) as part of the primary tablet shutting down since the source shards stop serving in SwitchWrites.

Issuing a LOCK TABLES statement on all of the tables that we are doing a SwitchWrites operation for as part of a MoveTables workflow ensures that no lost writes occur due to long running transactions (including long running single statement auto-commit transactions) on the source tablets committing after writes have been switched. I say this because:

  1. During SwitchWrites we first we update the DenyList on the source tablets to prevent any new writes against the tables at the tablet level in the query planner/executor.
  2. We then wait on a LOCK TABLES call to return on the source mysqld's which will in turn wait for any open/running RW statements/transactions on those tables to complete or time out.
  3. Af this point, after steps 1 and 2, we know that any existing writes against the source tables have completed and no new ones can come in.
  4. We then collect the vreplication metadata snapshot, put the reverse vreplication streams in place, and update the routing rules.

Without doing this (before this PR is merged) you can not only break reverse vreplication -- which would have to be fixed manually -- but if you don't notice the errant writes and then manually reconcile the tables across keyspaces (old and new) then those writes are lost.

With this patch, when you go through the manual test here and get to the point where you attempt another insert or commit in the previously opened transaction started against the source before we switched writes, you get an error like this:

ERROR 1317 (70100): vttablet: rpc error: code = Aborted desc = transaction 1641661850058067114: ended at 2022-01-08 17:12:17.090 UTC (exceeded timeout: 30s) (CallerID: userData1)

Related Issue(s)

Fixes: #9400

Checklist

  • Should this PR be backported? At least to 12.0, I think...
  • Tests were added
  • Documentation: I don't think we need any

go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
@mattlord mattlord force-pushed the switch_writes_lock_table branch 11 times, most recently from c1738e3 to 6e08248 Compare January 8, 2022 02:06
mattlord and others added 2 commits January 7, 2022 21:10
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review January 8, 2022 02:37
@mattlord mattlord requested a review from deepthi as a code owner January 8, 2022 02:37
@mattlord mattlord removed the request for review from deepthi January 8, 2022 02:37
And update tests

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the switch_writes_lock_table branch 3 times, most recently from 2970d27 to b68b2a0 Compare January 9, 2022 19:33
@DeathBorn
Copy link
Contributor

Before this implementation we found this manual workaround
 LOCK TABLES <all_tables_that_move_to_other_shard> READ;

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the switch_writes_lock_table branch 2 times, most recently from 70b417f to dd39d1c Compare January 10, 2022 19:29
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Member

@harshit-gangal harshit-gangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall.
some nits.

go/test/endtoend/vreplication/cluster.go Show resolved Hide resolved
go/vt/vttablet/tabletmanager/rpc_lock_tables.go Outdated Show resolved Hide resolved
go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
go/vt/wrangler/traffic_switcher.go Outdated Show resolved Hide resolved
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ltgm

@mattlord
Copy link
Contributor Author

Thank you for the review and comments, @rohit-nayak-ps and @harshit-gangal ! I just pushed another commit that addresses them. ❤️

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Matt Lord <mattalord@gmail.com>
@harshit-gangal harshit-gangal merged commit 918b2cf into vitessio:main Jan 14, 2022
@harshit-gangal harshit-gangal deleted the switch_writes_lock_table branch January 14, 2022 08:04
aliulis pushed a commit to vinted/vitess that referenced this pull request Feb 28, 2022
…table

Use LOCK TABLES during SwitchWrites to prevent lost writes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken reverse vreplication after SwitchWrites in a middle of long transaction
5 participants