Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic: support start task with inconsistent upstream table schema #3903

Merged
merged 56 commits into from
Feb 7, 2022

Conversation

GMHDBJD
Copy link
Contributor

@GMHDBJD GMHDBJD commented Dec 16, 2021

What problem does this PR solve?

Issue Number: close #3629
close #3786
close #3708

What is changed and how it works?

  • worker flush table info when entering sync stage
  • master get table info when receiving ddl info
  • remove info/operation/locks/drop columns when stop a optimistic task

For DM-master:

  • fetch table info for every table when creating a lock rather than use joined as init table
  • when master restart, no need to rebuild locks base on init schema, just handle info by order
  • when the task stops, remove all metadata for optimism, when restarting the task, we can fetch table info from downstream instead.

For DM-worker:

  • get and flush table info when entering sync stage

Check List

Tests

  • Unit test
  • Integration test

Release note

support start task with inconsistent upstream table schemas in optimistic shard mode.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 16, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Ehco1996
  • lance6716

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 16, 2021
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 16, 2021

/hold
/run-all-tests

@ti-chi-bot ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2021
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 16, 2021

/run-dm-integration-test
/verify

@GMHDBJD GMHDBJD changed the base branch from optimistDDLDev to master December 16, 2021 05:41
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 16, 2021

/run-all-tests

@asddongmen asddongmen added the area/dm Issues or PRs related to DM. label Dec 16, 2021
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 17, 2021

/run-all-tests

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2021
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2021
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 19, 2021

/run-dm-integration-test

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 19, 2021

/run-dm-integration-test

1 similar comment
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 19, 2021

/run-dm-integration-test

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 19, 2021

/run-dm-integration-test

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 20, 2021

/run-dm-integration-test

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Dec 20, 2021

/run-all-tests

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 26, 2022
@lance6716
Copy link
Contributor

/hold

waiting @lichunzhu review

@ti-chi-bot ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2022
Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review tests later. Have two questions now:

  1. Should we delete obsolete init-schema for dm-clusters upgraded from lower versions?
  2. Should we flush table info for newly added tables?

// TODO: handle drop table
continue
}
if !o.tk.SourceTableExist(info.Task, info.Source, info.UpSchema, info.UpTable, info.DownSchema, info.DownTable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too arbitrary? If we add some new tables while the scheduler didn't get at optimism.GetAllSourceTables, will it cause a problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is same as origin logic

// filter info which doesn't have SourceTable
// SourceTable will be changed after user update block-allow-list
// But old infos still remain in etcd.
// TODO: add a mechanism to remove all outdated infos in etcd.
if !lock.TableExist(info.Source, info.UpSchema, info.UpTable) {
delete(ifm[task][source][schema], table)
continue
}

For create table while leader is transferred, there may be some problem. I will create a pr for create table with more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql2, arg := cp.genUpdateSQL(sourceSchema, sourceTable, location, nil, tiBytes, false)
sqls = append(sqls, sql2)
args = append(args, arg)
batch := 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 100 as a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jan 26, 2022

Should we delete obsolete init-schema for dm-clusters upgraded from lower versions?

I leave a TODO in

// TODO: prune in etcd when upgrade
I think we should prune them after upgrading to a major version(e.g. 6.0)

Should we flush table info for newly added tables

Yes. We still have some problem with newly create/drop table(e.g. #3823), I will create a new pr to solve them.

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jan 26, 2022

/run-all-tests

1 similar comment
@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Jan 26, 2022

/run-all-tests

@GMHDBJD
Copy link
Contributor Author

GMHDBJD commented Feb 7, 2022

/run-all-tests

Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Comment on lines 231 to 243
function random_restart() {
mod=$(($RANDOM % 4))
if [[ "$mod" == "0" ]]; then
restart_master
elif [[ "$mod" == "1" ]]; then
restart_worker1
elif [[ "$mod" == "2" ]]; then
restart_worker2
else
restart_task $cur/conf/double-source-optimistic.yaml
fi
}

Copy link
Contributor

@lichunzhu lichunzhu Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's better to test all four cases for integration tests. It will be easier for us to trace which PR introduces the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to test different combinations of them. e.g.(restart master and then restart worker), but too many combinations, so use randome. I add some log for trace in 41cf696

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Comment on lines 244 to 350
run_sql_source1 "alter table ${shardddl1}.${tb2} add column b varchar(10);"
run_sql_source2 "alter table ${shardddl1}.${tb1} add column b varchar(10);"
run_sql_source2 "alter table ${shardddl1}.${tb2} add column b varchar(10);"

run_sql_source1 "insert into ${shardddl1}.${tb1} values(5,'aaa');"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(6,'bbb');"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(7,'ccc');"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(8,'ddd');"

run_sql_source1 "alter table ${shardddl1}.${tb1} add column c text;"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(9,'eee','eee');"
run_sql_source1 "alter table ${shardddl1}.${tb2} drop column b;"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(10);"
run_sql_source2 "alter table ${shardddl1}.${tb1} add column c text;"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(11,'fff','fff');"
run_sql_source2 "alter table ${shardddl1}.${tb2} drop column b;"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(12);"

run_sql_tidb_with_retry "select count(1) from ${shardddl}.${tb}" "count(1): 12"

run_dm_ctl $WORK_DIR "127.0.0.1:$MASTER_PORT" \
"stop-task test -s mysql-replica-02" \
"\"result\": true" 2

run_sql_source1 "alter table ${shardddl1}.${tb1} drop column b;"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(13,'ggg');"
run_sql_source1 "alter table ${shardddl1}.${tb2} add column c text;"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(14,'hhh');"

run_sql_tidb_with_retry "select count(1) from ${shardddl}.${tb}" "count(1): 14"
run_sql_tidb_with_retry "select count(1) from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='${shardddl}' AND TABLE_NAME='${tb}';" \
"count(1): 2"

run_dm_ctl $WORK_DIR "127.0.0.1:$MASTER_PORT" \
"start-task $cur/conf/double-source-optimistic.yaml -s mysql-replica-02" \
"\"result\": true" 2

run_sql_source1 "insert into ${shardddl1}.${tb1} values(15,'iii');"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(16,'jjj');"
run_sql_source2 "alter table ${shardddl1}.${tb1} drop column b;"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(17,'kkk');"
run_sql_source2 "alter table ${shardddl1}.${tb2} add column c text;"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(18,'lll');"

check_sync_diff $WORK_DIR $cur/conf/diff_config.toml
}

function DM_STOP_TASK_FOR_A_SOURCE_CASE() {
run_sql_source1 "insert into ${shardddl1}.${tb1} values(1);"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(2);"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(3);"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(4);"

run_sql_source1 "alter table ${shardddl1}.${tb1} add column b varchar(10);"
run_sql_source1 "alter table ${shardddl1}.${tb2} add column b varchar(10);"
run_sql_source2 "alter table ${shardddl1}.${tb1} add column b varchar(10);"
run_sql_source2 "alter table ${shardddl1}.${tb2} add column b varchar(10);"

run_sql_source1 "insert into ${shardddl1}.${tb1} values(5,'aaa');"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(6,'bbb');"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(7,'ccc');"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(8,'ddd');"

run_sql_source1 "alter table ${shardddl1}.${tb1} add column c text;"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(9,'eee','eee');"
run_sql_source1 "alter table ${shardddl1}.${tb2} drop column b;"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(10);"
run_sql_source2 "alter table ${shardddl1}.${tb1} add column c text;"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(11,'fff','fff');"
run_sql_source2 "alter table ${shardddl1}.${tb2} drop column b;"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(12);"

run_sql_tidb_with_retry "select count(1) from ${shardddl}.${tb}" "count(1): 12"

run_dm_ctl $WORK_DIR "127.0.0.1:$MASTER_PORT" \
"stop-task test -s mysql-replica-02" \
"\"result\": true" 2

run_sql_source1 "alter table ${shardddl1}.${tb1} drop column b;"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(13,'ggg');"
run_sql_source1 "alter table ${shardddl1}.${tb2} add column c text;"
run_sql_source1 "insert into ${shardddl1}.${tb2} values(14,'hhh');"

run_sql_tidb_with_retry "select count(1) from ${shardddl}.${tb}" "count(1): 14"
run_sql_tidb_with_retry "select count(1) from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='${shardddl}' AND TABLE_NAME='${tb}';" \
"count(1): 2"

run_dm_ctl $WORK_DIR "127.0.0.1:$MASTER_PORT" \
"start-task $cur/conf/double-source-optimistic.yaml -s mysql-replica-02" \
"\"result\": true" 2

run_sql_source1 "insert into ${shardddl1}.${tb1} values(15,'iii');"
run_sql_source1 "insert into ${shardddl1}.${tb1} values(16,'jjj');"
run_sql_source2 "alter table ${shardddl1}.${tb1} drop column b;"
run_sql_source2 "insert into ${shardddl1}.${tb1} values(17,'kkk');"
run_sql_source2 "alter table ${shardddl1}.${tb2} add column c text;"
run_sql_source2 "insert into ${shardddl1}.${tb2} values(18,'lll');"

check_sync_diff $WORK_DIR $cur/conf/diff_config.toml
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions look exactly the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lichunzhu
Copy link
Contributor

/unhold

@ti-chi-bot ti-chi-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2022
@lichunzhu
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 41cf696

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Feb 7, 2022
@ti-chi-bot ti-chi-bot merged commit 5c7d238 into pingcap:master Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look?
Projects
None yet
8 participants