Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC cloud: data inconsistency after CDC owner switching #2230

Closed
Tammyxia opened this issue Jul 6, 2021 · 8 comments · Fixed by #2417
Closed

CDC cloud: data inconsistency after CDC owner switching #2230

Tammyxia opened this issue Jul 6, 2021 · 8 comments · Fixed by #2417
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/critical subject/correctness Denotes an issue or pull request is related to correctness. type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented Jul 6, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • Pause 1x changefeed

  • Load data to upstream tidb: $ bin/go-ycsb load mysql -P workloads/betting -p recordcount=9900000 -p mysql.host=xxx -p mysql.port=4000 --threads 200 -p dbnameprefix=testcc -p databaseproportions=1.0 -p unitnameprefix=unit1 -p unitscount=1 -p tablecount=200 -p loadbatchsize=500 -p mysql.password=12345678

  • Make CDC owner switch: $ kubectl delete pod db-ticdc-1 --kubeconfig=/etc/kubernetes/cluster1.conf -n tidbxxx, delete ticdc-0, ticdc-1, ticdc-2 one by one.

  • Load data to upstream again: $ bin/go-ycsb run mysql -P workloads/betting -p operationcount=5000000 -p mysql.host=xxx -p mysql.port=4000 --threads 200 -p dbnameprefix=testcc -p databaseproportions=1.0 -p unitnameprefix=unit1 -p unitscount=1 -p tablecount=200 -p mysql.password=12345678

  • Make all pd restart: $ kubectl delete pod db-pd-0 db-pd-1 db-pd-2 --kubeconfig=/etc/kubernetes/cluster1.conf -n tidbxxx

  • Check changefeed status, when upstream has stopped to write, and sync completed, check if data consistent.

  1. What did you expect to see?
  • All data consistecy.
  1. What did you see instead?
  • 11 tables in 200 tables have data inconsistency, rows count are the same, but some value are different.
    sync-diff log:
    [2021/07/06 09:54:56.487 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_12] [where="((TRUE) AND TRUE)"] ["source checksum"=3733393904] ["target checksum"=2605381674] ["get source checksum cost"=250.960066ms] ["get target checksum cost"=206.317042ms]
    [2021/07/06 09:55:05.426 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_18] [where="((TRUE) AND TRUE)"] ["source checksum"=3258024002] ["target checksum"=317619229] ["get source checksum cost"=183.408826ms] ["get target checksum cost"=87.105012ms]
    [2021/07/06 09:55:21.657 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_11] [where="((TRUE) AND TRUE)"] ["source checksum"=446946590] ["target checksum"=3514961352] ["get source checksum cost"=214.137411ms] ["get target checksum cost"=146.654629ms]
    [2021/07/06 09:55:45.994 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_14] [where="((TRUE) AND TRUE)"] ["source checksum"=4122421317] ["target checksum"=274903492] ["get source checksum cost"=209.464336ms] ["get target checksum cost"=109.468714ms]
    [2021/07/06 09:56:19.416 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_79] [where="((TRUE) AND TRUE)"] ["source checksum"=1531020405] ["target checksum"=3060465391] ["get source checksum cost"=78.444038ms] ["get target checksum cost"=8.828986ms]
    [2021/07/06 09:56:31.858 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_1] [where="((TRUE) AND TRUE)"] ["source checksum"=1478455333] ["target checksum"=1441817492] ["get source checksum cost"=538.675238ms] ["get target checksum cost"=434.188737ms]
    [2021/07/06 09:57:15.840 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_statistics_agent_game_day] [where="((TRUE) AND TRUE)"] ["source checksum"=3721642887] ["target checksum"=4238525603] ["get source checksum cost"=82.975369ms] ["get target checksum cost"=19.861323ms]
    [2021/07/06 09:57:19.254 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_21] [where="((TRUE) AND TRUE)"] ["source checksum"=4167676186] ["target checksum"=3207274853] ["get source checksum cost"=116.469711ms] ["get target checksum cost"=107.899528ms]
    [2021/07/06 09:57:25.812 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_61] [where="((TRUE) AND TRUE)"] ["source checksum"=1741577261] ["target checksum"=3192701895] ["get source checksum cost"=76.139605ms] ["get target checksum cost"=15.00056ms]
    [2021/07/06 09:57:35.276 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_statistics_user_game_day] [where="((TRUE) AND TRUE)"] ["source checksum"=3141552518] ["target checksum"=4014241092] ["get source checksum cost"=90.358791ms] ["get target checksum cost"=30.339078ms]
    [2021/07/06 09:57:50.224 +00:00] [WARN] [diff.go:551] ["checksum is not equal"] [table=testcc0.unit10_game_bets_game_tag_analysis_31] [where="((TRUE) AND TRUE)"] ["source checksum"=1131104107] ["target checksum"=2463968134] ["get source checksum cost"=139.692822ms] ["get target checksum cost"=86.867459ms]

For exmaple in table testcc0. unit10_statistics_agent_game_day, rows count is 4155, the different row is " id: 946 “

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      5.0.2
      
    • TiCDC version (execute cdc version):

      ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-dev] [git-hash=9300b05ceec2c4811198416a5d21e9f4910deddd] [git-branch=cloud-cdc-5.0] [utc-build-time="2021-06-26 14:16:57"] [go-version="go version go1.16.3 linux/amd64"] [failpoint-build=false]
      
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/critical labels Jul 6, 2021
@Tammyxia
Copy link
Author

Tammyxia commented Jul 6, 2021

bet_sum decimal(20,4) NOT NULL DEFAULT '0.0000' COMMENT

@Tammyxia
Copy link
Author

Tammyxia commented Jul 6, 2021

bet_valid decimal(20,4) NOT NULL DEFAULT '0.0000' COMMENT

@ben1009 ben1009 self-assigned this Jul 7, 2021
@ben1009
Copy link
Contributor

ben1009 commented Jul 9, 2021

trying to reproduce the issue at IDC, not reproduced yet

@ben1009
Copy link
Contributor

ben1009 commented Jul 9, 2021

run several times with load
bin/go-ycsb load mysql -P workloads/betting -p recordcount=99000 -p mysql.port=4000 --threads 20 -p dbnameprefix=cclh1 -p databaseproportions=1.0 -p unitnameprefix=unit1 -p unitscount=1 -p tablecount=3 -p loadbatchsize=50 switch owner by kill the cdc service
then run sync-diff
[2021/07/09 15:38:54.191 +08:00] [INFO] [main.go:69] ["check pass!!!"]

@amyangfei
Copy link
Contributor

[2021/07/06 08:16:37.058 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4803] [connInfo="id:4803, addr:10.240.213.151:58536 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130636266012692, conflictStartTS=426130636213583891, conflictCommitTS=426130641915215874, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:17:28.254 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4903] [connInfo="id:4903, addr:10.240.194.72:50136 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130649976143927, conflictStartTS=426130647380918276, conflictCommitTS=426130653724803075, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:17:51.538 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4797] [connInfo="id:4797, addr:10.240.213.151:58534 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130655913705473, conflictStartTS=426130651470364674, conflictCommitTS=426130657538998274, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:18:06.481 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4793] [connInfo="id:4793, addr:10.240.213.151:58538 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130661064835074, conflictStartTS=426130657145782273, conflictCommitTS=426130662690127874, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:18:20.820 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4795] [connInfo="id:4795, addr:10.240.213.151:58548 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130663463452673, conflictStartTS=426130659701686274, conflictCommitTS=426130665809641501, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:18:25.486 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4797] [connInfo="id:4797, addr:10.240.213.151:58534 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130666097999927, conflictStartTS=426130662493519873, conflictCommitTS=426130668575260674, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:18:45.121 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4805] [connInfo="id:4805, addr:10.240.213.151:58558 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130671170486273, conflictStartTS=426130669217513473, conflictCommitTS=426130674185142274, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:18:53.173 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4793] [connInfo="id:4793, addr:10.240.213.151:58538 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130672009347074, conflictStartTS=426130668784975873, conflictCommitTS=426130673778819075, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:19:13.810 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=4905] [connInfo="id:4905, addr:10.240.194.72:50114 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130677396406273, conflictStartTS=426130677999337475, conflictCommitTS=426130677999337475, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:25:12.807 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=5045] [connInfo="id:5045, addr:10.240.194.72:51774 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130772292534273, conflictStartTS=426130767573942274, conflictCommitTS=426130773524611073, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]
[2021/07/06 08:32:34.846 +00:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=5049] [connInfo="id:5049, addr:10.240.194.72:51764 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=commit] [txn_mode=OPTIMISTIC] [err="[kv:9007]Write conflict, txnStartTS=426130888710684678, conflictStartTS=426130887216463887, conflictCommitTS=426130893062275074, key=? [try again later]\nprevious statement: replace into `testcc0` . `unit10_xxx_yyy` ( `id` , `user_id` , `symoney` , `padding` ) values ( ... , (_charset) ? )"]

In a inconsistent scenario, we found multiple nodes write to TiDB with the same table.

@amyangfei
Copy link
Contributor

amyangfei commented Jul 10, 2021

Cloud env log analysis

From the original log, there could be several reasons:

  1. A table pipeline is not well closed, however in new owner this can be detected if table count in processor memory is not same with table count in tableStatus in etcd.
  2. table sink sends more data to backend sink, which causes the extra row data.

Reproduce

I have reproduced multi TiCDC nodes write the same table in test environment, to make bug reproduce more easily, I injected a delay before a table pipeline is closed. ref: amyangfei@8fc643f. (The branch is based on release-5.1)

The test procedure is simple

#!/bin/bash

while true; do
    mysql -h127.0.0.1 -uroot -P4000 -e "update test5.t1 set a=a+1 where id<20;update test5.t2 set a=a+1 where id<20;update test5.t3 set a=a+1 where id<20;update test5.t4 set a=a+1 where id<20;update test5.t5 set a=a+1 where id<20;update test5.t6 set a=a+1 where id<20;"
    sleep 0.3
done
  • Start another TiCDC node
  • Observe two TiCDC nodes will write to the same table at the same time, and Write conflict can be found in cdc/tidb log too.

Take tableID=110, `test5`.`t3` as an example.

$ grep tableID=110 cdc*.log
cdc_8301.log:[2021/07/10 16:56:13.894 +08:00] [INFO] [puller.go:223] ["puller is initialized"] [duration=17.074112ms] [changefeed=test-cf] [tableID=110] [spans="[\"[7480000000000000ff6e5f720000000000fa, 7480000000000000ff6e5f730000000000fa)\"]"] [resolvedTs=426221850955874305]
cdc.log:[2021/07/10 16:55:04.608 +08:00] [INFO] [puller.go:223] ["puller is initialized"] [duration=5.880109ms] [changefeed=test-cf] [tableID=110] [spans="[\"[7480000000000000ff6e5f720000000000fa, 7480000000000000ff6e5f730000000000fa)\"]"] [resolvedTs=426221823312789505]
cdc.log:[2021/07/10 16:55:45.678 +08:00] [INFO] [table.go:107] ["send async stop signal to table"] [tableID=110] [targetTs=426221850955874305] []
➜  grep t3 cdc.log|tail -n 1|cut -c1-1000
[2021/07/10 16:56:30.183 +08:00] [INFO] [mysql.go:1043] ["prepare DMLs"] [rows="[{\"start-ts\":426221853354754054,\"commit-ts\":426221853354754055,\"row-id\":14,\"table\":{\"db-name\":\"test5\",\"tbl-name\":\"t3\",\"tbl-id\":110,\"is-partition\":false},\"table-info-version\":426221823312789505,\"replica-id\":0,\"columns\":[{\"name\":\"id\",\"type\":8,\"flag\":27,\"value\":13},{\"name\":\"a\",\"type\":3,\"flag\":65,\"value\":2608}],\"pre-columns\":[{\"name\":\"id\",\"type\":8,\"flag\":27,\"value\":13},{\"name\":\"a\",\"type\":3,\"flag\":65,\"value\":2607}]},{\"start-ts\":426221853354754054,\"commit-ts\":426221853354754055,\"row-id\":8,\"table\":{\"db-name\":\"test5\",\"tbl-name\":\"t3\",\"tbl-id\":110,\"is-partition\":false},\"table-info-version\":426221823312789505,\"replica-id\":0,\"columns\":[{\"name\":\"id\",\"type\":8,\"flag\":27,\"value\":7},{\"name\":\"a\",\"type\":3,\"flag\":65,\"value\":2608}],\"pre-columns\":[{\"name\":\"id\",\"type\":8,\"flag\":27,\"value\":7},{\"name\":\"a\"

2021/07/10 16:56:13.894 +08:00 is the table initialized time in the second(TiCDC-8301) node.

[2021/07/10 16:56:30.183 +08:00] is the last time that the first TiCDC node(TiCDC-8300) writes to table `test5`.`t3`.

The commit-ts of row change event is 426221853354754055, which is larger than the table stop-ts 426221850955874305

From these two logs we can confirm the first TiCDC node writes to the table t3 after t3 is started in the second TiCDC node.

cdc logs:
cloud-cdc-2230.tar.gz

Cause analysis

Questions remained

  • Why this scenario is triggered frequently in the cloud environment?

@amyangfei
Copy link
Contributor

amyangfei commented Jul 10, 2021

From the cloud env log, this issue is the same scenario (new capture joins TiCDC cluster and TiCDC owner reschedules some tables to this new capture) with #2244. IMHO we can close one of them @Tammyxia

$ grep -iE "welcome|campaign|start re" cdc*.log
cdc0.log:[2021/07/06 08:10:05.756 +00:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-dev] [git-hash=9300b05ceec2c4811198416a5d21e9f4910deddd] [git-branch=cloud-cdc-5.0] [utc-build-time="2021-06-26 14:16:57"] [go-version="go version go1.16.3 linux/amd64"] [failpoint-build=false]
cdc1.log:[2021/07/06 08:09:56.162 +00:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-dev] [git-hash=9300b05ceec2c4811198416a5d21e9f4910deddd] [git-branch=cloud-cdc-5.0] [utc-build-time="2021-06-26 14:16:57"] [go-version="go version go1.16.3 linux/amd64"] [failpoint-build=false]
cdc1.log:[2021/07/06 08:09:56.234 +00:00] [INFO] [capture.go:237] ["campaign owner successfully"] [capture-id=d35515c2-e80e-4e9d-b436-383b9d753692]
cdc1.log:[2021/07/06 08:09:58.320 +00:00] [INFO] [scheduler.go:347] ["Start rebalancing"] [changefeed=16d1e27f-2d90-4dc4-8e4f-71801650ae6d] [table-num=202] [capture-num=1] [target-limit=202]
cdc1.log:[2021/07/06 08:10:05.834 +00:00] [INFO] [scheduler.go:347] ["Start rebalancing"] [changefeed=16d1e27f-2d90-4dc4-8e4f-71801650ae6d] [table-num=202] [capture-num=2] [target-limit=101]
cdc1.log:[2021/07/06 08:10:14.597 +00:00] [INFO] [scheduler.go:347] ["Start rebalancing"] [changefeed=16d1e27f-2d90-4dc4-8e4f-71801650ae6d] [table-num=202] [capture-num=3] [target-limit=68]
cdc2.log:[2021/07/06 08:10:14.509 +00:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-dev] [git-hash=9300b05ceec2c4811198416a5d21e9f4910deddd] [git-branch=cloud-cdc-5.0] [utc-build-time="2021-06-26 14:16:57"] [go-version="go version go1.16.3 linux/amd64"] [failpoint-build=false]

@amyangfei
Copy link
Contributor

amyangfei commented Jul 12, 2021

One possible hotfix:

  1. When processing a table removed operation, don't cancel the table pipeline directly. Instead owner waits the processor's checkpoint-ts reaches the global resolved-ts. Then continue the table cancel logic. https://github.com/pingcap/ticdc/blob/bda60ed68503464403f61695bb1c24315627e65f/cdc/processor/processor.go#L390-L392
  2. Use global resolved-ts as start-ts when dispatching this table to other processor.
  3. When pipeline sink node receives pipeline.CommandTypeStopAtTs, sends an event to MySQL sink to indicate drop events of specific table. (The event contains current global resolved-ts, schema and table name) This has a risk when the same table is re-added to this processor, when a table sink starts, it sends a reset event to MySQL sink. MySQL sink should maintain a disabled table list, which is used for data recall. Drop event filter may introduce some performance regression, but not too much. Besides we can record the latest commit-ts of each table in processor and adds the filter boundary.

Since step 3 is too ugly, an alternative to step 3, use global resolved-ts to flush row changed data to buffer sink

https://github.com/pingcap/ticdc/blob/548cf9c925accae7544f4c914875710bb3f1e140/cdc/processor/pipeline/sink.go#L131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/critical subject/correctness Denotes an issue or pull request is related to correctness. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants