Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix the error handling for io.EOF may miss region reconnect (#1641) #1647

Merged

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Apr 13, 2021

cherry-pick #1641 to release-4.0
You can switch your code base to this Pull Request by using git-extras:

# In ticdc repo:
git pr https://github.com/pingcap/ticdc/pull/1647

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/ticdc.git pr/1647:release-4.0-000eb0fb246c

What problem does this PR solve?

Fix bug-1.3 in #1633

This is also another bug that can trigger "received an event but neither pending region nor running region was found

➜  go test -count=1 -v --tags leak -check.f TestStreamRecvWithErrorIOEOF
=== RUN   Test
[2021/04/12 11:18:38.114 +08:00] [INFO] [client.go:355] ["get clusterID"] [id=1]
[2021/04/12 11:18:38.116 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.116 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=2] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.119 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":2,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1138\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_cache.go:426] ["invalidate current region, because others failed on same store"] [region=3] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.130 +08:00] [INFO] [client.go:716] ["cannot get rpcCtx, retry span"] [regionID=3] [span="[61, 62)"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":3,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.131 +08:00] [ERROR] [client.go:830] ["send request to stream failed"] [addr=127.0.0.1:36297] [storeID=1] [regionID=3] [requestID=3] [error=EOF] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:830\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:568\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=4] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":4,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [ERROR] [client.go:1246] ["received an event but neither pending region nor running region was found"] [regionID=3] [requestID=3] [addr=127.0.0.1:36297] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).sendRegionChangeEvent\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1246\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1192\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1194\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]

What is changed and how it works?

Treat io.EOF as common error and re-establish all regions.

Check List

Tests

  • Unit test
  • Integration test

Related changes

  • Need to cherry-pick to the release branch

Release note

  • Fix the error handling for io.EOF may cause replication interuption.

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot added component/kv-client TiKV kv log client component. status/ptal Could you please take a look? size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/4.0-cherry-pick type/bugfix This PR fixes a bug. labels Apr 13, 2021
@ti-srebot ti-srebot requested review from zier-one and overvenus April 13, 2021 02:14
@ti-chi-bot ti-chi-bot requested a review from amyangfei April 13, 2021 02:14
@ti-srebot ti-srebot added this to the v4.0.13 milestone Apr 13, 2021
@liuzix
Copy link
Contributor

liuzix commented Apr 15, 2021

/run-all-tests

@amyangfei amyangfei force-pushed the release-4.0-000eb0fb246c branch from a8f14a0 to ae54e29 Compare April 27, 2021 08:44
@amyangfei amyangfei changed the base branch from release-4.0 to release-4.0-pending April 27, 2021 08:44
@amyangfei
Copy link
Contributor

/run-all-tests

@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 27, 2021
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 27, 2021
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • amyangfei
  • leoppro

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 27, 2021
@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 27, 2021
@amyangfei
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: cf949d9

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 27, 2021
@amyangfei amyangfei merged commit 75e0088 into pingcap:release-4.0-pending Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kv-client TiKV kv log client component. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look? type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants