Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix region loss in single region handler #3281

Merged
merged 4 commits into from
Nov 5, 2021

Conversation

amyangfei
Copy link
Contributor

@amyangfei amyangfei commented Nov 4, 2021

What problem does this PR solve?

Close #3288

Fix a region loss case, we can reproduce as following steps

  1. Create a TiDB cluster with 5 TiKV nodes, create a table with 4TB data(or 100K regions)
  2. Setup a TiCDC cluster to replicate table in step-1
  3. Use TiUP to restart all these TiKVs without transferring leader region, by tiup cluster restart <cluster-name> -R tikv command for example.
  4. Observe the replication stucks even all regions are initialized. (Both cached regions and scanning regions are zero)

By comparing the initialized regions in TiCDC and all regions by querying select region_id from information_schema.tikv_region_status where db_name = 'xx' and table_name = 'yy' from TiDB, we can observe some regions are lost.

By querying the lost region id in TiCDC log, we found the region disconnected without reconnect

[2021/11/04 17:22:10.083 +08:00] [INFO] [client.go:777] ["start new request"] [request="{\"header\":{\"cluster_id\":7012827302444215878,\"ticdc_version\":\"5.2.0-master\"},\"region_id\":1122272,\"region_epoch\":{\"conf_ver\":980,\"version\":57145},\"checkpoint_ts\":428872226710487042,\"start_key\":\"dIAAAAAAAAD/L19ygAAAADL/CQwfAAAAAAD6\",\"end_key\":\"dIAAAAAAAAD/L19ygAAAADL/CpK/AAAAAAD6\",\"request_id\":383728,\"extra_op\":1,\"Request\":null}"] [addr=172.16.7.56:20161]
[2021/11/04 17:22:10.086 +08:00] [INFO] [region_worker.go:243] ["single region event feed disconnected"] [regionID=1122272] [requestID=383728] [span="[7480000000000000ff2f5f728000000032ff090c1f0000000000fa, 7480000000000000ff2f5f728000000032ff0a92bf0000000000fa)"] [checkpoint=428872226710487042] [error="[CDC:ErrEventFeedEventError]not_leader:<region_id:1122272 > : not_leader:<region_id:1122272 > "]
[2021/11/04 17:22:10.087 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=105] [regionID=1122272] [startKey=7480000000000000ff2f5f728000000032ff090c1f0000000000fa] [endKey=7480000000000000ff2f5f728000000032ff0a92bf0000000000fa] [checkpointTs=428872226710487042]

The root cause is kv client must recycle all failed regions, so we should use the root context of a kv client to call onRegionFail

This bug tends to happen when multiple TiKVs crash or forcing restart, and based on existing test, one TiKV crashes or restarts doesn't trigger this bug. And the more regions, the higher probability.

What is changed and how it works?

Use the parent context of region worker to call onRegionFail when processing region failure.

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Release note

Fix a bug that TiCDC could meet replication interruption when multiple TiKVs crash or forcing restart.

@amyangfei amyangfei added component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. needs-cherry-pick-release-5.1 Should cherry pick this PR to release-5.1 branch. needs-cherry-pick-release-5.2 Should cherry pick this PR to release-5.2 branch. needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. labels Nov 4, 2021
@amyangfei amyangfei added this to the v5.3.0 milestone Nov 4, 2021
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Nov 4, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • hi-rustin
  • overvenus

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 4, 2021
@amyangfei
Copy link
Contributor Author

/run-all-tests

@ti-chi-bot ti-chi-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 4, 2021
@amyangfei
Copy link
Contributor Author

/run-integration-tests

@amyangfei
Copy link
Contributor Author

/run-leak-tests

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2021

Codecov Report

Merging #3281 (649cc62) into master (de2fca8) will increase coverage by 0.0119%.
The diff coverage is 81.5789%.

@@               Coverage Diff                @@
##             master      #3281        +/-   ##
================================================
+ Coverage   56.7226%   56.7345%   +0.0119%     
================================================
  Files           214        214                
  Lines         22915      22919         +4     
================================================
+ Hits          12998      13003         +5     
+ Misses         8604       8603         -1     
  Partials       1313       1313                

@amyangfei amyangfei added the area/ticdc Issues or PRs related to TiCDC. label Nov 5, 2021
@@ -273,7 +273,7 @@ func (w *regionWorker) handleSingleRegionError(ctx context.Context, err error, s
}

revokeToken := !state.initialized
err2 := w.session.onRegionFail(ctx, regionErrorInfo{
err2 := w.session.onRegionFail(w.parentCtx, regionErrorInfo{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some comments here? I see another place where it is used added a comment why we use parentCtx instead of ctx.

See: https://github.com/pingcap/ticdc/blob/37bac66f0673103892c71e585ab3d9d4658c1f74/cdc/kv/region_worker.go#L792
It looks like the changes here are also for this purpose, sorry I'm not very familiar with this piece of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the same reason

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 5, 2021
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 5, 2021
@amyangfei
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: dadd865

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 5, 2021
@ti-chi-bot
Copy link
Member

@amyangfei: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit 2a60472 into pingcap:master Nov 5, 2021
@amyangfei amyangfei deleted the fix-region-loss branch November 5, 2021 03:53
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #3289.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #3290.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #3291.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #3292.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #3293.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. needs-cherry-pick-release-5.1 Should cherry pick this PR to release-5.1 branch. needs-cherry-pick-release-5.2 Should cherry pick this PR to release-5.2 branch. needs-cherry-pick-release-5.3 Should cherry pick this PR to release-5.3 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TiCDC met replication interruption when multiple TiKVs crash or ungraceful restart.
5 participants