Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switchwrites: error if no tablets available on target for reverse replication #8142

Merged
merged 2 commits into from
Jun 9, 2021

Conversation

rohit-nayak-ps
Copy link
Contributor

Description

This PR addresses an edge case while switching writes.

  • VReplication Workflow is setup with -tablet_types selected as RDONLY
  • Source keyspace has RDONLY tablets and the forward replication is completed
  • Reads are switched to the target keyspace
  • SwitchWrites is called next to switch all traffic to the target, with the -reverse_replication true flag
  • However there are no RDONLY tablets on the target, so the reverse replication workflow never starts
  • User only finds out after the writes are switched. In the meanwhile due to problems on the target
    the user tries to rollback by reversing the writes.
  • This fails with a timeout waiting for the source to catchup with the target

This PR adds a check on invoking SwitchWrites that there are tablets available to source data from the target keyspace if reverse replication is enabled. Note that this only checks for tablets being available when SwitchWrites is called. If, in this usecase, the RDONLY tablets disappear in the future then the workflow will stop replicating resulting in the same problem as outlined above.

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I feel like we should do this check while creating any stream, for all workflows.

…e are no tablets available on the target to source data for the reverse flow

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
@rohit-nayak-ps
Copy link
Contributor Author

This is looking good. I feel like we should do this check while creating any stream, for all workflows.

The case fixed here is the one that caused the user issue recently where they were not able to reverse after detecting an issue after cutover causing an unexpected outage.

I agree we should validate for available tablets while creating streams as well. For those use cases though, the workflow would just not start and the error would be visible in Workflow Show or by looking at the vreplication table.

Adding other validations will involve some time writing the tests for it, so I would like to move ahead with this for now and implement those later.

@rohit-nayak-ps rohit-nayak-ps requested a review from a team June 8, 2021 15:07
@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review June 8, 2021 15:08
@rohit-nayak-ps rohit-nayak-ps merged commit 6c418fe into vitessio:main Jun 9, 2021
@rohit-nayak-ps rohit-nayak-ps deleted the rn-vr-improvements branch June 9, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants