TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

hicqu · 2023-12-04T06:56:06Z

What did you do?

We have found there could be many "stream to store closed" errors.
We think it's because of this logic in cdc/kv/region_worker.go:

// checkShouldExit checks whether the region worker should exit, if exit return an error
func (w *regionWorker) checkShouldExit() error {
    empty := w.checkRegionStateEmpty()
    // If there is no region maintained by this region worker, exit it and
    // cancel the gRPC stream.
    if empty { /* It should be wrong because pending regions are not considered correctly. */
        w.cancelStream(time.Duration(0))
        return cerror.ErrRegionWorkerExit.GenWithStackByArgs()
    }
    return nil
}

What did you expect to see?

No response

What did you see instead?

Pending regions should be considered correctly in checkShouldExit.

Versions of the cluster

6.5.x, 7.1.x.

Fix versions

v6.5.9
v7.1.4

The text was updated successfully, but these errors were encountered:

hicqu · 2023-12-04T09:34:55Z

It's duplicated with #10136 .

…#10240) close #10239

…#10241) close #10239

…#10242) close #10239

asddongmen · 2024-01-25T07:04:57Z

$ grep deregister tikv-server-2024-01-17T23-03-35.694.log | grep conn|wc -l
9289

In a recent incident investigation, we found that the latency of CDC kept increasing to a level of ten minutes after tikv upgrade.

The grep result above means that in v6.5.6 TiCDC would still cancel gRPC connection incorrectly. (Cancel connection 9289 times in 30 second, which is unexcpeted.)
We need to reopen this issue again and do further investigation to solve it.

asddongmen · 2024-01-26T08:45:45Z

I successfully reproduced the user's phenomenon locally by using the modified kvClient in: asddongmen@c78377f.

The root cause of this problem is that the stream inside the eventFeedSession, which represent connection to a tikv, are being added/canceled/deleted in multiple goroutines. The order of these actions is not properly handled, resulting in the phenomenon where a stream is created in a goroutine(g-1) and then deleted by the other goroutine(g-2). Eventually, this will lead to TiCDC continuously initiating and canceling connections to the same TiKV in a loop.

I will filed a pull request to fix this issue ASAP.

asddongmen · 2024-01-29T06:59:41Z

The root cause is described as follows:
In the kvClient, the requestRegionToStore function listens for new region information from the regionRouter in a for loop. Upon receiving new region info, it checks for an existing gRPC stream in the cache:

If it doesn't exist, a new gRPC stream to the store is created and added to the cache using store.addr as the key. Then, a goroutine is spawned (let's call it g-2) to handle events from the gRPC stream. [1]
If it exists, CDC sends a new request to listen to the region. [2]

After restarting TiKV, an error at line:699 leads to canceling and deleting the corresponding gRPC stream at line:728 by store.addr (delete-1). The g-2 goroutine notices the cancellation and exits. Then, at line:672, the stream is deleted from the cache by store.addr again (delete-2).

If a new region for this store arrives from regionRouter between delete-1 and delete-2, a new stream to the store is created and cached as described above at [1].The stream deleted from the cache at delete-2 would be the new stream. This may lead to a continuous loop of stream creation and deletion until g-2 can exit before a new stream is created, breaking the loop.

shiqinwang · 2024-01-30T22:05:03Z

Thanks @asddongmen for the detailed explanation. I have a few questions:

Why does the delete-2 happen? Is it an unexpected behavior?
How do you know this is what happened in the reported incident? Are there any logs as evidence?
What is the error at line 699 that leads to the stream cancellation?

I am a bit skeptical about this root cause as it seems pretty rare that the exact sequence (delete-1, create a new stream, delete-2) can happen, let alone in the incident, the impact last for 15 min. Do you mean the exact sequence occurred in a loop for 15 min and for ~200 regions at the same time?

asddongmen · 2024-01-31T05:43:05Z

@shiqinwang

delete-2 is unexpected, I have remove delete-2 in this PR: https://github.com/pingcap/tiflow/pull/10559/files#diff-31a8a1f4c2bd92516aebe3c694c8ab06eb69f88aa09a95212f1f5e81a91fcf1cL696

Yes there are logs are evident:

grep -E 'creating new stream to store to send request|stream to store closed|receive from stream canceled'  ticdc-server-2024-01-17T23-03-53.087.log | grep "addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160" | wc -l
    2126

The grep result above indicates that kvClient creates and cancels streams to the same store (jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160) a thousand times within 10 seconds. Additionally, other logs reveal that all these streams are associated with the same table: [tableID=189] [tableName=jobs.actions].

Upon examining the log, we observe multiple instances of "stream to store closed" logs appearing between two "creating new stream to store to send request" logs. However, it is expected that a creation and a close should occur in pairs.

[2024/01/17 23:03:47.312 +00:00] [INFO] [client.go:663] ["creating new stream to store to send request"] [namespace=default] [changefeed=jobs-jobs] [regionID=55512292] [requestID=9106229] [storeID=52517211] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160]

[2024/01/17 23:03:47.320 +00:00] [INFO] [client.go:1060] ["receive from stream canceled"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]
[2024/01/17 23:03:47.320 +00:00] [INFO] [client.go:995] ["stream to store closed"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]
[2024/01/17 23:03:47.321 +00:00] [INFO] [client.go:995] ["stream to store closed"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]

[2024/01/17 23:03:47.323 +00:00] [INFO] [client.go:663] ["creating new stream to store to send request"] [namespace=default] [changefeed=jobs-jobs] [regionID=55954657] [requestID=9106268] [storeID=52517211] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160]

Based on the logs above, it appears that there are instances of unexpected creation and deletion of gRPC streams.

The provided logs are incomplete, so we cannot determine the cause of the original error at line [699](

tiflow/cdc/kv/client.go

Line 699 in e29f81d

err = stream.client.Send(req)

).
It is like this.
All regions of the same store share a gRPC stream. When a stream is canceled, all the regions (potentially thousands) that share this stream will need to reconnect due to the error.
Therefore, if this situation occurs once, it will cause a cascade effect afterwards. Because when a stream is canceled due to an error, other regions that depend on it will also disconnect and then attempt to reconnect. Within a few seconds, there may be hundreds of thousands of region reconnection requests, making it easy to accidentally encounter this error again.

The main idea of the fix is to prevent unexpected cancellation and creation of streams to a same TiKV store in a short period of time. After applying the fix, for a table, only one gRPC stream will be closed and created when TiKV restarts.

grep "creating new stream to store to send request" cdc.log cdc-2024-01-30T12-15-04.293.log | grep 10.2.7.146:20160 | grep tableID=146 | wc -l
1

close #10239

shiqinwang · 2024-01-31T19:45:35Z

@asddongmen Thanks for the explanation.

I saw a lot of warn logs like ticdc-server-2024-01-17T23-00-22.258.log:[2024/01/17 23:00:18.527 +00:00] [WARN] [client.go:704] ["send request to stream failed"] [namespace=default] [changefeed=jobs-jobs] [tableID=189] [tableName=`jobs`.`actions`] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211] [regionID=55693736] [requestID=7302837] [error=EOF] What does error=EOF mean? I am curious why did ticdc get so many failures when calling requestRegionToStore?
After removing delete-2, would ticdc still encounter such send request failures? If so, it would still stuck in the reconnection loop right?

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

shiqinwang · 2024-01-31T20:22:04Z

My understanding on the sequence of events in your theory is:

Sent a stream request s1
Spawned g1
Encountered an error
Deleted the stream s1
Sent a request to create the stream s2
Spawned g2
g1 deleted s2
Sent a request to create the stream s3
Spawned g3
g2 deleted s3

Is that correct? if g3 exists before s4 is created, then the loop stops

asddongmen · 2024-02-01T03:41:01Z

@asddongmen Thanks for the explanation.

I saw a lot of warn logs like ticdc-server-2024-01-17T23-00-22.258.log:[2024/01/17 23:00:18.527 +00:00] [WARN] [client.go:704] ["send request to stream failed"] [namespace=default] [changefeed=jobs-jobs] [tableID=189] [tableName=`jobs`.`actions`] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211] [regionID=55693736] [requestID=7302837] [error=EOF] What does error=EOF mean? I am curious why did ticdc get so many failures when calling requestRegionToStore? After removing delete-2, would ticdc still encounter such send request failures? If so, it would still stuck in the reconnection loop right?

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

All regions of the same store send requests through a single gRPC stream. Therefore, if the stream encounters an error and is canceled, all upcoming regions' requestRegionToStore operations will fail.
Even after removing delete-2, TiCDC will still encounter send request failures when a stream is canceled. This can happen when there is an error while receiving a message from the stream and new regions from the regionRouter need to send new requests. However, the fix PR ensures that TiCDC does not get stuck in a reconnection loop due to repeated occurrences of unexpected gRPC stream canceling. The following strategies are used to address this issue:
- Adds a streamAlterInterval to ensure that a gRPC stream of the same store can only be created or deleted once within the specified streamAlterInterval period.
- Adds a stream ID to identify a gRPC stream and prevent a new stream from being unexpectedly deleted by an old stream holder. The stream ID is checked before every deletion.
- Ensures that addStream and deleteStream are called in the same goroutine, preventing the occurrence of a loop.

asddongmen · 2024-02-01T03:51:09Z

My understanding on the sequence of events in your theory is:

Sent a stream request s1

Spawned g1

Encountered an error

Deleted the stream s1

Sent a request to create the stream s2

Spawned g2

g1 deleted s2

Sent a request to create the stream s3

Spawned g3

g2 deleted s3

Is that correct? if g3 exists before s4 is created, then the loop stops

Yes. When a TiKV is just restarted, there will be a huge amount of region leader transfers, and this leads to a huge number of regions from regionRouter that need to send requests. So it is easier to get stuck in a loop in such a situation.

asddongmen · 2024-02-01T04:06:36Z

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

delete-1 at line:728 will do the clean up.
Let's consider a scenario as belows:

A gRPC stream(s1, addr=x) encounters an error, it will be canceled in g1 but will not be deleted by g1.
A region(r1, addr=x) coming from regionRouter in the for loop, r1 get the dead stream(s1) from the cache with addr=x as key, and use it to send a request, fails, an error will be return at line:699 and s1 will be deleted from the cache at line:728.
r1 being sent to the regionRouter again, and will be handled later.

close #10239

…ingcap#10570) close pingcap#10239

close #10239

seiya-annie · 2024-06-04T02:22:09Z

/found customer

hicqu added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Dec 4, 2023

hicqu added the severity/moderate label Dec 4, 2023

hicqu self-assigned this Dec 4, 2023

hicqu added affects-6.5 affects-7.1 labels Dec 4, 2023

hicqu mentioned this issue Dec 4, 2023

kv-client(cdc): correct conditions of canceling grpc streams #10237

Merged

hicqu added the affects-7.5 label Dec 4, 2023

hicqu closed this as completed in #10237 Dec 4, 2023

This was referenced Dec 4, 2023

kv-client(cdc): correct conditions of canceling grpc streams (#10237) #10240

Merged

kv-client(cdc): correct conditions of canceling grpc streams (#10237) #10241

Merged

kv-client(cdc): correct conditions of canceling grpc streams (#10237) #10242

Merged

hicqu added the release-blocker This issue blocks a release. Please solve it ASAP. label Dec 4, 2023

hicqu added the duplicate Issues or pull requests already exists. label Dec 4, 2023

ti-chi-bot bot pushed a commit that referenced this issue Dec 4, 2023

kv-client(cdc): correct conditions of canceling grpc streams (#10237) (…

067bb80

…#10240) close #10239

ti-chi-bot bot pushed a commit that referenced this issue Dec 4, 2023

kv-client(cdc): correct conditions of canceling grpc streams (#10237) (…

8de9f27

…#10241) close #10239

ti-chi-bot bot pushed a commit that referenced this issue Dec 4, 2023

kv-client(cdc): correct conditions of canceling grpc streams (#10237) (…

0050e31

…#10242) close #10239

nongfushanquan mentioned this issue Dec 14, 2023

release: add v7.1.3 release notes pingcap/docs-cn#15672

Merged

18 tasks

asddongmen reopened this Jan 25, 2024

asddongmen removed the duplicate Issues or pull requests already exists. label Jan 26, 2024

asddongmen self-assigned this Jan 26, 2024

asddongmen changed the title ~~TiCDC may cancel grpc streams incorrectly~~ TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck Jan 26, 2024

This was referenced Jan 26, 2024

stability (cdc) improve the stability of TiCDC #10343

Open

kv (ticdc): fix kvClient reconnection downhill loop #10559

Merged

ti-chi-bot bot closed this as completed in #10559 Jan 31, 2024

ti-chi-bot bot pushed a commit that referenced this issue Jan 31, 2024

kv (ticdc): fix kvClient reconnection downhill loop (#10559)

98adc64

close #10239

This was referenced Jan 31, 2024

kv (ticdc): fix kvClient reconnection downhill loop (#10559) #10570

Merged

kv (ticdc): fix kvClient reconnection downhill loop (#10559) #10571

Merged

kv (ticdc): fix kvClient reconnection downhill loop (#10559) #10572

Closed

asddongmen mentioned this issue Feb 4, 2024

changefeed may stuck when tikv upgrade/restart/evict leader #10584

Closed

ti-chi-bot bot pushed a commit that referenced this issue Feb 7, 2024

kv (ticdc): fix kvClient reconnection downhill loop (#10559) (#10570)

266a068

close #10239

sdojjy pushed a commit to sdojjy/tiflow that referenced this issue Feb 8, 2024

kv (ticdc): fix kvClient reconnection downhill loop (pingcap#10559) (p…

960ef2a

…ingcap#10570) close pingcap#10239

sdojjy mentioned this issue Feb 8, 2024

kv (ticdc): fix kvClient reconnection downhill loop (#10559) (#10570) #10599

Merged

ti-chi-bot bot pushed a commit that referenced this issue Feb 26, 2024

kv (ticdc): fix kvClient reconnection downhill loop (#10559) (#10571)

55cc017

close #10239

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024

github-project-automation bot added this to Question and Bug Reports Aug 28, 2024

github-project-automation bot moved this to Done in Question and Bug Reports Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

hicqu commented Dec 4, 2023 •

edited by CharlesCheung96

Loading

hicqu commented Dec 4, 2023

asddongmen commented Jan 25, 2024

asddongmen commented Jan 26, 2024

asddongmen commented Jan 29, 2024 •

edited by CharlesCheung96

Loading

shiqinwang commented Jan 30, 2024

asddongmen commented Jan 31, 2024 •

edited

Loading

shiqinwang commented Jan 31, 2024

shiqinwang commented Jan 31, 2024

asddongmen commented Feb 1, 2024

asddongmen commented Feb 1, 2024

asddongmen commented Feb 1, 2024 •

edited

Loading

seiya-annie commented Jun 4, 2024

TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

Comments

hicqu commented Dec 4, 2023 • edited by CharlesCheung96 Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

Fix versions

hicqu commented Dec 4, 2023

asddongmen commented Jan 25, 2024

asddongmen commented Jan 26, 2024

asddongmen commented Jan 29, 2024 • edited by CharlesCheung96 Loading

shiqinwang commented Jan 30, 2024

asddongmen commented Jan 31, 2024 • edited Loading

shiqinwang commented Jan 31, 2024

shiqinwang commented Jan 31, 2024

asddongmen commented Feb 1, 2024

asddongmen commented Feb 1, 2024

asddongmen commented Feb 1, 2024 • edited Loading

seiya-annie commented Jun 4, 2024

hicqu commented Dec 4, 2023 •

edited by CharlesCheung96

Loading

asddongmen commented Jan 29, 2024 •

edited by CharlesCheung96

Loading

asddongmen commented Jan 31, 2024 •

edited

Loading

asddongmen commented Feb 1, 2024 •

edited

Loading