Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC met replication interruption when multiple TiKVs crash or ungraceful restart. #3288

Closed
amyangfei opened this issue Nov 5, 2021 · 0 comments · Fixed by #3281
Closed
Labels
affects-4.0 affects-5.0 affects-5.1 affects-5.2 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.
Milestone

Comments

@amyangfei
Copy link
Contributor

What did you do?

we can reproduce as following steps

  1. Create a TiDB cluster with 5 TiKV nodes, create a table with 4TB data(or 100K regions)
  2. Setup a TiCDC cluster to replicate table in step-1
  3. Use TiUP to restart all these TiKVs without transferring leader region, by tiup cluster restart <cluster-name> -R tikv command for example.
  4. Observe the replication stucks even all regions are initialized. (Both cached regions and scanning regions are zero)

By comparing the initialized regions in TiCDC and all regions by querying select region_id from information_schema.tikv_region_status where db_name = 'xx' and table_name = 'yy' from TiDB, we can observe some regions are lost.

By querying the lost region id in TiCDC log, we found the region disconnected without reconnect

[2021/11/04 17:22:10.083 +08:00] [INFO] [client.go:777] ["start new request"] [request="{\"header\":{\"cluster_id\":7012827302444215878,\"ticdc_version\":\"5.2.0-master\"},\"region_id\":1122272,\"region_epoch\":{\"conf_ver\":980,\"version\":57145},\"checkpoint_ts\":428872226710487042,\"start_key\":\"dIAAAAAAAAD/L19ygAAAADL/CQwfAAAAAAD6\",\"end_key\":\"dIAAAAAAAAD/L19ygAAAADL/CpK/AAAAAAD6\",\"request_id\":383728,\"extra_op\":1,\"Request\":null}"] [addr=172.16.7.56:20161]
[2021/11/04 17:22:10.086 +08:00] [INFO] [region_worker.go:243] ["single region event feed disconnected"] [regionID=1122272] [requestID=383728] [span="[7480000000000000ff2f5f728000000032ff090c1f0000000000fa, 7480000000000000ff2f5f728000000032ff0a92bf0000000000fa)"] [checkpoint=428872226710487042] [error="[CDC:ErrEventFeedEventError]not_leader:<region_id:1122272 > : not_leader:<region_id:1122272 > "]
[2021/11/04 17:22:10.087 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=105] [regionID=1122272] [startKey=7480000000000000ff2f5f728000000032ff090c1f0000000000fa] [endKey=7480000000000000ff2f5f728000000032ff0a92bf0000000000fa] [checkpointTs=428872226710487042]

The root cause is kv client must recycle all failed regions, so we should use the root context of a kv client to call onRegionFail

This bug tends to happen when multiple TiKVs crash or forcing restart, and based on existing test, one TiKV crashes or restarts doesn't trigger this bug. And the more regions, the higher probability.

What did you expect to see?

cdc runs normally

What did you see instead?

Some regions are missed and replication interrupt

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.2.1

TiCDC version (execute cdc version):

master@pingcap/ticdc@37bac66

@amyangfei amyangfei added type/bug The issue is confirmed as a bug. severity/major labels Nov 5, 2021
@amyangfei amyangfei added this to the v5.3.0 milestone Nov 5, 2021
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jan 18, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jan 19, 2022
zhaoxinyu pushed a commit to zhaoxinyu/ticdc that referenced this issue Jan 20, 2022
@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 affects-5.0 affects-5.1 affects-5.2 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants