Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix the error handling for io.EOF may miss region reconnect (#1641) #1648

Merged
merged 6 commits into from
Apr 13, 2021

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Apr 13, 2021

cherry-pick #1641 to release-5.0
You can switch your code base to this Pull Request by using git-extras:

# In ticdc repo:
git pr https://github.com/pingcap/ticdc/pull/1648

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/ticdc.git pr/1648:release-5.0-000eb0fb246c

What problem does this PR solve?

Fix bug-1.3 in #1633

This is also another bug that can trigger "received an event but neither pending region nor running region was found

➜  go test -count=1 -v --tags leak -check.f TestStreamRecvWithErrorIOEOF
=== RUN   Test
[2021/04/12 11:18:38.114 +08:00] [INFO] [client.go:355] ["get clusterID"] [id=1]
[2021/04/12 11:18:38.116 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.116 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=2] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.119 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":2,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1138\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_cache.go:426] ["invalidate current region, because others failed on same store"] [region=3] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.130 +08:00] [INFO] [client.go:716] ["cannot get rpcCtx, retry span"] [regionID=3] [span="[61, 62)"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":3,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.131 +08:00] [ERROR] [client.go:830] ["send request to stream failed"] [addr=127.0.0.1:36297] [storeID=1] [regionID=3] [requestID=3] [error=EOF] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:830\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:568\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=4] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":4,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [ERROR] [client.go:1246] ["received an event but neither pending region nor running region was found"] [regionID=3] [requestID=3] [addr=127.0.0.1:36297] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).sendRegionChangeEvent\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1246\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1192\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1194\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]

What is changed and how it works?

Treat io.EOF as common error and re-establish all regions.

Check List

Tests

  • Unit test
  • Integration test

Related changes

  • Need to cherry-pick to the release branch

Release note

  • Fix the error handling for io.EOF may cause replication interuption.

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot added component/kv-client TiKV kv log client component. status/ptal Could you please take a look? size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-5.0 This PR is cherry-picked to release-5.0 from a source PR. type/bugfix This PR fixes a bug. labels Apr 13, 2021
@ti-srebot ti-srebot added this to the v5.0.1 milestone Apr 13, 2021
@zhouqiang-cl zhouqiang-cl added the cherry-pick-approved Cherry pick PR approved by release team. label Apr 13, 2021
@amyangfei
Copy link
Contributor

/run-all-tests

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 13, 2021
@zier-one
Copy link
Contributor

/lgtm

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • amyangfei
  • leoppro

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 13, 2021
@zier-one
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 2eb8d6c

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 13, 2021
@ti-chi-bot ti-chi-bot merged commit e5e4e1e into pingcap:release-5.0 Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. component/kv-client TiKV kv log client component. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look? type/bugfix This PR fixes a bug. type/cherry-pick-for-release-5.0 This PR is cherry-picked to release-5.0 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants