after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

lance6716 · 2022-04-26T03:38:19Z

What did you do?

task

...
block-allow-list:        # 上游数据库实例匹配的表的 block-allow-list 过滤规则集，如果 DM 版本 <= v2.0.0-beta.2 则使用 black-white-list
  bw-rule-1:             # 黑白名单配置的名称
    do-dbs: ["test"] # 迁移哪些库
...
filters:
  filter-rule-1:
    schema-pattern: "test"
    table-pattern: "test1"
    events: ["all ddl"]
    action: Ignore

create test.test1 in upstream
start-task --remove-meta
alter table test1 add column c4 int; in upstream
create table test2 (c int primary key); in upstream or wait 30s, to flush checkpoints
insert into test1. Now task will report error because downstream doesn't have column c4
alter table test1 add column c4 int; in downstream
resume-task

What did you expect to see?

task goes on

What did you see instead?

gen insert sqls failed, sourceTable: test.test1, targetTable: test.test1: Column count doesn't match value count: 3 (columns) vs 4 (values)",

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

at least v5.4.0

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2022-04-26T03:39:21Z

Root cause: (I only look at the source code, didn't check the actual behaviour)

For the first time when error happens, it's downstream error "Error 1054: Unknown column...". For this time genSQL is succeeded and DML job is added to queue, the TableInfo in memory table checkpoint is filled with downstream table structure. After error happens, in checkpoint.Rollback memory checkpoint is rollbacked to flushed checkpoint which has nil TableInfo, and schema tracker resets the table structure

When task is resumed (or auto resumed), table checkpoint and schema tracker doesn't contains the TableInfo so we will use downstream table structure. But at this time, the first step is schema tracker loaded the downstream table structure, and soon we failed at genSQL for the error "Column count doesn't match value count". Note that at this time we didn't save TableInfo to memory table checkpoint, but the table checkpoint still exists because it's created when the first error happens and didn't get dropped by DROP TABLE. Then in checkpoint.Rollback because memory table checkpoint has nil TableInfo, schema tracker didn't reset, and also in following logic schema tracker didn't drop the table since the memory table checkpoint exists.

To me, this is caused by TableInfo in schema tracker is not consistent with memory table checkpoint. We can fix it when refine the code.

D3Hunter · 2022-04-26T09:14:52Z

cannot reproduce in current master, and there is another bug: after auto-resume on first error, the dml is skipped too, and global point is larger than table point:

+------------+-----------+----------+------------------+------------+
| id         | cp_schema | cp_table | binlog_name      | binlog_pos |
+------------+-----------+----------+------------------+------------+
| mysql-3306 |           |          | mysql-bin.000001 |        766 |
| mysql-3306 | test      | test2    | mysql-bin.000001 |        505 |
| mysql-3306 | test      | test1    | mysql-bin.000001 |        735 |
+------------+-----------+----------+------------------+------------+

dm-worker1.log

lance6716 · 2022-04-26T09:18:05Z

cannot reproduce in current master, and there is another bug: after auto-resume on first error, the dml is skipped too, and global point is larger than table point:

+------------+-----------+----------+------------------+------------+
| id         | cp_schema | cp_table | binlog_name      | binlog_pos |
+------------+-----------+----------+------------------+------------+
| mysql-3306 |           |          | mysql-bin.000001 |        766 |
| mysql-3306 | test      | test2    | mysql-bin.000001 |        505 |
| mysql-3306 | test      | test1    | mysql-bin.000001 |        735 |
+------------+-----------+----------+------------------+------------+

if you checkout the test part of my pr, it's expected to fail

and please upload the log for above case. if the table is skipped, its table checkpoint may not be updated. but dml should not be lost.

D3Hunter · 2022-04-26T10:20:06Z

~~5.3 doesn't has this issue, can auto recover~~ ---> due to keepalive failed and recreate another syncer

niubell · 2022-04-27T00:53:27Z

/assign gmhdbjd

niubell · 2022-04-27T00:53:45Z

/unassign lance6716

ref #5272

) ref #5272

lance6716 · 2022-05-12T03:31:20Z

fixed by #5273

) ref #5272

lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Apr 26, 2022

lance6716 added severity/moderate affects-5.4 may-affects-5.3 may-affects-6.0 labels Apr 26, 2022

lance6716 added affects-6.0 and removed may-affects-6.0 labels Apr 26, 2022

lance6716 mentioned this issue Apr 26, 2022

syncer(dm): save table checkpoint after a DDL is filtered #5273

Merged

lance6716 self-assigned this Apr 26, 2022

lance6716 added affects-5.3 and removed may-affects-5.3 labels Apr 26, 2022

D3Hunter mentioned this issue Apr 26, 2022

failed rows are skipped due to checkpoint flush #5279

Closed

D3Hunter removed the affects-5.3 label Apr 26, 2022

D3Hunter added affects-5.3 and removed affects-6.0 labels Apr 26, 2022

lance6716 added the affects-6.0 label Apr 26, 2022

ti-chi-bot assigned GMHDBJD Apr 27, 2022

ti-chi-bot unassigned lance6716 Apr 27, 2022

ti-chi-bot pushed a commit that referenced this issue Apr 27, 2022

syncer(dm): save table checkpoint after a DDL is filtered (#5273)

7744c05

ref #5272

This was referenced Apr 27, 2022

syncer(dm): save table checkpoint after a DDL is filtered (#5273) #5290

Merged

syncer(dm): save table checkpoint after a DDL is filtered (#5273) #5291

Closed

syncer(dm): save table checkpoint after a DDL is filtered (#5273) #5292

Merged

lance6716 assigned lance6716 and unassigned GMHDBJD Apr 27, 2022

ti-chi-bot added a commit that referenced this issue Apr 27, 2022

syncer(dm): save table checkpoint after a DDL is filtered (#5273) (#5290

f393119

) ref #5272

This was referenced May 7, 2022

releases: add tidb 5.4.1 release notes pingcap/docs#8436

Merged

release notes: add v5.4.1 release notes pingcap/docs-cn#9264

Merged

D3Hunter mentioned this issue May 9, 2022

tracker(dm): close and recreate tracker when pause and resume #5350

Merged

lance6716 closed this as completed May 12, 2022

ti-chi-bot added a commit that referenced this issue May 25, 2022

syncer(dm): save table checkpoint after a DDL is filtered (#5273) (#5292

f934ba6

) ref #5272

This was referenced Jun 23, 2022

releases: add tidb 5.3.2 release notes pingcap/docs#9029

Merged

releases: add v5.3.2 release notes pingcap/docs-cn#9914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

lance6716 commented Apr 26, 2022

lance6716 commented Apr 26, 2022 •

edited

Loading

D3Hunter commented Apr 26, 2022 •

edited

Loading

lance6716 commented Apr 26, 2022 •

edited

Loading

D3Hunter commented Apr 26, 2022 •

edited

Loading

niubell commented Apr 27, 2022

niubell commented Apr 27, 2022

lance6716 commented May 12, 2022

after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

after filtered some DDL event and manually fix downstream, tracker can't track table structure #5272

Comments

lance6716 commented Apr 26, 2022

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Apr 26, 2022 • edited Loading

D3Hunter commented Apr 26, 2022 • edited Loading

lance6716 commented Apr 26, 2022 • edited Loading

D3Hunter commented Apr 26, 2022 • edited Loading

niubell commented Apr 27, 2022

niubell commented Apr 27, 2022

lance6716 commented May 12, 2022

current status of DM cluster (execute `query-status <task-name>` in dmctl)

lance6716 commented Apr 26, 2022 •

edited

Loading

D3Hunter commented Apr 26, 2022 •

edited

Loading

lance6716 commented Apr 26, 2022 •

edited

Loading

D3Hunter commented Apr 26, 2022 •

edited

Loading