-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239
Comments
It's duplicated with #10136 . |
$ grep deregister tikv-server-2024-01-17T23-03-35.694.log | grep conn|wc -l
9289 In a recent incident investigation, we found that the latency of CDC kept increasing to a level of ten minutes after tikv upgrade. The grep result above means that in v6.5.6 TiCDC would still cancel gRPC connection incorrectly. (Cancel connection 9289 times in 30 second, which is unexcpeted.) |
I successfully reproduced the user's phenomenon locally by using the modified kvClient in: asddongmen@c78377f. The root cause of this problem is that the stream inside the I will filed a pull request to fix this issue ASAP. |
The root cause is described as follows:
After restarting TiKV, an error at line:699 leads to canceling and deleting the corresponding gRPC stream at line:728 by store.addr (delete-1). The g-2 goroutine notices the cancellation and exits. Then, at line:672, the stream is deleted from the cache by store.addr again (delete-2). If a new region for this store arrives from regionRouter between delete-1 and delete-2, a new stream to the store is created and cached as described above at [1].The stream deleted from the cache at delete-2 would be the new stream. This may lead to a continuous loop of stream creation and deletion until g-2 can exit before a new stream is created, breaking the loop. |
Thanks @asddongmen for the detailed explanation. I have a few questions:
I am a bit skeptical about this root cause as it seems pretty rare that the exact sequence (delete-1, create a new stream, delete-2) can happen, let alone in the incident, the impact last for 15 min. Do you mean the exact sequence occurred in a loop for 15 min and for ~200 regions at the same time? |
The main idea of the fix is to prevent unexpected cancellation and creation of streams to a same TiKV store in a short period of time. After applying the fix, for a table, only one gRPC stream will be closed and created when TiKV restarts.
|
@asddongmen Thanks for the explanation. I saw a lot of warn logs like Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream? |
My understanding on the sequence of events in your theory is:
Is that correct? if g3 exists before s4 is created, then the loop stops |
|
Yes. When a TiKV is just restarted, there will be a huge amount of region leader transfers, and this leads to a huge number of regions from regionRouter that need to send requests. So it is easier to get stuck in a loop in such a situation. |
delete-1 at line:728 will do the clean up.
|
/found customer |
What did you do?
We have found there could be many "stream to store closed" errors.
We think it's because of this logic in
cdc/kv/region_worker.go
:What did you expect to see?
No response
What did you see instead?
Pending regions should be considered correctly in
checkShouldExit
.Versions of the cluster
6.5.x, 7.1.x.
Fix versions
The text was updated successfully, but these errors were encountered: