Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix the error handling for io.EOF may miss region reconnect #1641

Merged
merged 2 commits into from
Apr 13, 2021

Conversation

amyangfei
Copy link
Contributor

@amyangfei amyangfei commented Apr 12, 2021

What problem does this PR solve?

Fix bug-1.3 in #1633. Also described in https://github.com/pingcap/ticdc/issues/1393

This is also another bug that can trigger "received an event but neither pending region nor running region was found

➜  go test -count=1 -v --tags leak -check.f TestStreamRecvWithErrorIOEOF
=== RUN   Test
[2021/04/12 11:18:38.114 +08:00] [INFO] [client.go:355] ["get clusterID"] [id=1]
[2021/04/12 11:18:38.116 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.116 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=2] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.119 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":2,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.129 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1138\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_cache.go:426] ["invalidate current region, because others failed on same store"] [region=3] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.130 +08:00] [INFO] [client.go:716] ["cannot get rpcCtx, retry span"] [regionID=3] [span="[61, 62)"]
[2021/04/12 11:18:38.130 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.131 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":3,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.131 +08:00] [ERROR] [client.go:830] ["send request to stream failed"] [addr=127.0.0.1:36297] [storeID=1] [regionID=3] [requestID=3] [error=EOF] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:830\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:568\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:776] ["creating new stream to store to send request"] [regionID=3] [requestID=4] [storeID=1] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:822] ["start new request"] [request="{\"header\":{\"cluster_id\":1},\"region_id\":3,\"checkpoint_ts\":100,\"start_key\":\"YQ==\",\"end_key\":\"Yg==\",\"request_id\":4,\"Request\":null}"] [addr=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [ERROR] [client.go:1246] ["received an event but neither pending region nor running region was found"] [regionID=3] [requestID=3] [addr=127.0.0.1:36297] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).sendRegionChangeEvent\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1246\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1192\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [client.go:1099] ["stream to store closed"] [addr=127.0.0.1:36297] [storeID=1]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:601] ["mark store's regions need be refill"] [store=127.0.0.1:36297]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_cache.go:620] ["switch region peer to next due to send request fail"] [current="region ID: 3, meta: id:3 peers:<id:4 store_id:1 > , peer: id:4 store_id:1 , addr: 127.0.0.1:36297, idx: 0, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"] [errorVerbose="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting\ngithub.com/pingcap/errors.AddStack\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1106\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).receiveFromStream\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:1194\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest.func1\n\t/home/apple/work/code/pingcap/ticdc/cdc/kv/client.go:812\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/apple/.gvm/pkgsets/go1.15/global/pkg/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/home/apple/.gvm/gos/go1.15/src/runtime/asm_amd64.s:1374"]
[2021/04/12 11:18:38.179 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=1] [regionID=3] [startKey=61] [endKey=62] [checkpointTs=100]

What is changed and how it works?

Treat io.EOF as common error and re-establish all regions.

Check List

Tests

  • Unit test
  • Integration test

Related changes

  • Need to cherry-pick to the release branch

Release note

  • Fix the error handling for io.EOF may cause replication interuption.

@amyangfei amyangfei added type/bugfix This PR fixes a bug. component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. labels Apr 12, 2021
@amyangfei amyangfei added this to the v5.0.1 milestone Apr 12, 2021
@ti-chi-bot ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2021
@amyangfei
Copy link
Contributor Author

/run-all-tests

@amyangfei amyangfei added the status/ptal Could you please take a look? label Apr 12, 2021
@amyangfei
Copy link
Contributor Author

/run-all-tests

@amyangfei
Copy link
Contributor Author

/run-integration-tests

@liuzix
Copy link
Contributor

liuzix commented Apr 12, 2021

/lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 12, 2021
@zier-one
Copy link
Contributor

/lgtm

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • leoppro
  • liuzix

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 13, 2021
@zier-one
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: be64536

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 13, 2021
@ti-chi-bot ti-chi-bot merged commit 000eb0f into pingcap:master Apr 13, 2021
ti-srebot pushed a commit to ti-srebot/ticdc that referenced this pull request Apr 13, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #1647

ti-srebot pushed a commit to ti-srebot/ticdc that referenced this pull request Apr 13, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-5.0 in PR #1648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look? type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants