Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: Guard against unsafe _vt.vreplication writes #14797

Merged
merged 21 commits into from
Dec 27, 2023

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Dec 17, 2023

Description

There is no limit on the number of concurrent active workflows you can have in a keyspace. This means that any time we're updating the _vt.vreplication table we should uniquely identify a workflow using the id (unique stream for a unique workflow) or workflow name. If we do not, then in various situations we can perform errant steps that can lead to workflow failure or even data corruption.

This PR fixes the few places where we were NOT already doing this (actually, this part was extracted to #14826 for backports) — with the exception of the Resharder/StreamMigrator as this is an exception where we DO want to affect multiple workflows (as we move them from the active to inactive shards for a Reshard).

This PR also adds a guardrail for UPDATE and DELETE statements to help prevent Vitess users and developers from accidentally introducing related bugs in the future. For example:

❯ vtctlclient VReplicationExec zone1-101 "delete from _vt.vreplicati" 2>/dev/null

WARNING: VReplicationExec is deprecated and will be removed in a future release. Please use 'Workflow -- <keyspace.workflow> <action>' instead.

VReplicationExec Error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000101: invalid table name: vreplicati

❯ vtctlclient VReplicationExec zone1-101 "delete from _vt.vreplication where state != 'Running'" 2>/dev/null

WARNING: VReplicationExec is deprecated and will be removed in a future release. Please use 'Workflow -- <keyspace.workflow> <action>' instead.

VReplicationExec Error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000100: unsafe WHERE clause in delete without the /*vt+ ALLOW_UNSAFE_VREPLICATION_WRITE */ comment directive:  where state != 'Running'; should be using = or in with at least one of the following columns: id, workflow

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

vitess-bot bot commented Dec 17, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Dec 17, 2023
@mattlord mattlord changed the title VReplication: Remove _vt.vreplication updates w/o id or workflow name VReplication: Remove vreplication updates w/o id or workflow name Dec 17, 2023
@mattlord mattlord changed the title VReplication: Remove vreplication updates w/o id or workflow name VReplication: Remove _vt.vreplication updates w/o id or workflow name Dec 17, 2023
@github-actions github-actions bot added this to the v19.0.0 milestone Dec 17, 2023
@mattlord mattlord added Type: Bug Component: VReplication Backport to: release-16.0 and removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Dec 17, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Dec 18, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_uniq_update branch 2 times, most recently from 7b4d520 to 914f35b Compare December 18, 2023 21:24
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumptions related to stream migrations are incorrect here, as commented around those changes.

The fix for reverse streams in traffic_switcher is good.

The change for resharder makes the code clearer , but it is a noop, since there are no other streams on the target at that time.

// everything in the test is shutting down and cleaning up. So we retry a few
// times to deal with that non-problematic and ephemeral issue.
var err error
retries := 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an issue in CI is just a 3 second retry enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only been seen in the CI AFAIK, and so far 3 retries with a 1 second delay seems to have been enough (we've done essentially the same thing in several places now).

go/vt/vtctl/workflow/stream_migrator.go Outdated Show resolved Hide resolved
go/vt/vtctl/workflow/stream_migrator.go Outdated Show resolved Hide resolved
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title VReplication: Remove _vt.vreplication writes w/o id or workflow name VReplication: Guard against _vt.vreplication writes w/o id or workflow name Dec 20, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title VReplication: Guard against _vt.vreplication writes w/o id or workflow name VReplication: Guard against unsafe _vt.vreplication writes Dec 20, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments, otherwise looks great.

// statements if you want to bypass the safety checks that ensure you are
// being selective. The full comment directive looks like this:
// delete /*vt+ ALLOW_UNSAFE_VREPLICATION_WRITE */ from _vt.vreplication
const AllowUnsafeWriteCommentDirective = "ALLOW_UNSAFE_VREPLICATION_WRITE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

go/vt/wrangler/vexec.go Outdated Show resolved Hide resolved
@@ -25,7 +25,7 @@ import (

type iswitcher interface {
lockKeyspace(ctx context.Context, keyspace, action string) (context.Context, func(*error), error)
cancelMigration(ctx context.Context, sm *StreamMigrator)
cancelStreamMigrations(ctx context.Context, sm *StreamMigrator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelMigration is a better name: this is not just cancelling the stream migrations, but also reverting writes, cleaning up reverse replication etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Initially I was just going to rename the one that has a StreamMigrator receiver and got carried away... 😄 I'll change the ones using the switcher interface back as those are more generic.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
This shows the user how to get around the guardrail if
they really know what they're doing.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit ab37170 into vitessio:main Dec 27, 2023
104 checks passed
@mattlord mattlord deleted the vrepl_uniq_update branch December 27, 2023 01:11
systay pushed a commit that referenced this pull request Jul 22, 2024
…on writes (#4038)

* cherry pick of 14797

* Address merge conflict

Signed-off-by: Matt Lord <mattalord@gmail.com>

---------

Signed-off-by: Matt Lord <mattalord@gmail.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Bug Report: VReplication does not properly handle multiple concurrent workflows in a keyspace
4 participants