Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VReplication does not properly handle multiple concurrent workflows in a keyspace #14795

Closed
mattlord opened this issue Dec 15, 2023 · 0 comments · Fixed by #14797
Closed

Comments

@mattlord
Copy link
Contributor

mattlord commented Dec 15, 2023

Overview of the Issue

There are various places where we're not uniquely identifying a workflow when we issue an update -- either using the id field or the workflow name field:

go/vt/vtctl/workflow/resharder.go:              query := fmt.Sprintf("update _vt.vreplication set state='Running' where db_name=%s", encodeString(targetPrimary.DbName()))
go/vt/wrangler/resharder.go:            query := fmt.Sprintf("update _vt.vreplication set state='Running' where db_name=%s", encodeString(targetPrimary.DbName()))

go/vt/vtctl/workflow/stream_migrator.go:                query := fmt.Sprintf("update _vt.vreplication set state='Running', stop_pos=null, message='' where db_name=%s and workflow != %s", encodeString(source.GetPrimary().DbName()), encodeString(sm.ts.ReverseWorkflowName()))

go/vt/wrangler/traffic_switcher.go:             query := fmt.Sprintf("update _vt.vreplication set state='Running', message='' where db_name=%s", encodeString(source.GetPrimary().DbName()))
go/vt/vtctl/workflow/traffic_switcher.go:               query := fmt.Sprintf("update _vt.vreplication set state='Running', message='' where db_name=%s", encodeString(source.GetPrimary().DbName()))

A big one is in the traffic switcher. The leads to errant states when you have multiple active workflows that you are switching traffic back and forth for.

Reproduction Steps

git checkout main && make build

pushd examples/local

./101_initial_cluster.sh; mysql < ../common/insert_commerce_data.sql; ./201_customer_tablets.sh

vtctldclient MoveTables --workflow commerce2customer_customer --target-keyspace customer create --source-keyspace commerce --tables "customer"

vtctldclient MoveTables --workflow commerce2customer_corder --target-keyspace customer create --source-keyspace commerce --tables "corder"

command mysql -u root --socket=${VTDATAROOT}/vt_0000000201/mysql.sock vt_customer -e "select * from _vt.vreplication\G" --binary-as-hex=false

vtctldclient MoveTables --workflow commerce2customer_customer --target-keyspace customer SwitchTraffic

vtctldclient MoveTables --workflow commerce2customer_corder --target-keyspace customer SwitchTraffic

command mysql -u root --socket=${VTDATAROOT}/vt_0000000201/mysql.sock vt_customer -e "select * from _vt.vreplication\G" --binary-as-hex=false

vtctldclient MoveTables --workflow commerce2customer_corder --target-keyspace customer ReverseTraffic

command mysql -u root --socket=${VTDATAROOT}/vt_0000000201/mysql.sock vt_customer -e "select * from _vt.vreplication\G" --binary-as-hex=false

./401_teardown.sh

popd

The original commerce2customer_customer workflow should still be stopped and frozen, but it's also running when we reverse traffic for the commerce2customer_corder workflow. That is due to the query issued here w/o the workflow name (same in the vtctlclient/wrangler client code and the vtctldclient/server code):

func (ts *trafficSwitcher) startReverseVReplication(ctx context.Context) error {
return ts.ForAllSources(func(source *workflow.MigrationSource) error {
query := fmt.Sprintf("update _vt.vreplication set state='Running', message='' where db_name=%s", encodeString(source.GetPrimary().DbName()))
_, err := ts.VReplicationExec(ctx, source.GetPrimary().Alias, query)
return err
})
}

The test case fails because when switching writes for a ReverseTraffic, the original workflow is the reverse one and it unfroze and started both of them.

Binary Version

vtgate version Version: 19.0.0-SNAPSHOT (Git revision 50241809572f2f4b0fbd03251a502030e07a2c13 branch 'main') built on Fri Dec 15 18:04:48 EST 2023 by matt@pslord.local using go1.21.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant