ticdc: update cdc gc service doc and failed changefeed faq (pingcap#1…

…6583)
lilin90 · Feb 23, 2024 · 936e9ab · 936e9ab
1 parent 7e7b22d
commit 936e9ab
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 4 deletions.
diff --git a/ticdc/ticdc-changefeed-overview.md b/ticdc/ticdc-changefeed-overview.md
@@ -15,15 +15,16 @@ The state of a replication task represents the running status of the replication
 
 The states in the preceding state transfer diagram are described as follows:
 
-- `Normal`: The replication task runs normally and the checkpoint-ts proceeds normally.
+- `Normal`: The replication task runs normally and the checkpoint-ts proceeds normally. A changefeed in this state blocks GC to advance.
 - `Stopped`: The replication task is stopped, because the user manually pauses the changefeed. The changefeed in this state blocks GC operations.
 - `Warning`: The replication task returns an error. The replication cannot continue due to some recoverable errors. The changefeed in this state keeps trying to resume until the state transfers to `Normal`. The maximum retry time is 30 minutes. If it exceeds this time, the changefeed enters a failed state. The changefeed in this state blocks GC operations.
 - `Finished`: The replication task is finished and has reached the preset `TargetTs`. The changefeed in this state does not block GC operations.
 - `Failed`: The replication task fails. The changefeed in this state does not keep trying to resume. To give you enough time to handle the failure, the changefeed in this state blocks GC operations. The duration of the blockage is specified by the `gc-ttl` parameter, with a default value of 24 hours. If the underlying issue is resolved within this duration, you can manually resume the changefeed. Otherwise, if the changefeed remains in this state beyond the `gc-ttl` duration, the replication task cannot resume and cannot be recovered.
 
 > **Note:**
 >
-> If the changefeed encounters errors with error codes `ErrGCTTLExceeded`, `ErrSnapshotLostByGC`, or `ErrStartTsBeforeGC`, it does not block GC operations.
+> - If GC is blocked by a changefeed, the changefeed will block GC advancement for up to the time specified by `gc-ttl`. After that, the changefeed will be set to the `failed` state, with an error type of `ErrGCTTLExceeded`, and will no longer block GC advancement.
+> - If the changefeed encounters errors with error codes `ErrGCTTLExceeded`, `ErrSnapshotLostByGC`, or `ErrStartTsBeforeGC`, it does not block GC operations.
 
 The numbers in the preceding state transfer diagram are described as follows.
 

diff --git a/ticdc/ticdc-faq.md b/ticdc/ticdc-faq.md
@@ -64,7 +64,7 @@ When the replication task is unavailable or interrupted, this feature ensures th
 When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`. You can also [use TiUP to modify](/ticdc/deploy-ticdc.md#modify-ticdc-cluster-configurations-using-tiup) `gc-ttl`. The default value is 24 hours. In TiCDC, this value means:
 
 - The maximum time the GC safepoint is retained at the PD after the TiCDC service is stopped.
-- The maximum time a replication task can be suspended after the task is interrupted or manually stopped. If the time for a suspended replication task is longer than the value set by `gc-ttl`, the replication task enters the `failed` status, cannot be resumed, and cannot continue to affect the progress of the GC safepoint.
+- When TiKV's GC is blocked by TiCDC's GC safepoint, `gc-ttl` indicates the maximum replication delay of a TiCDC replication task. If the delay of the replication task exceeds the value set by `gc-ttl`, the replication task enters into the `failed` state and reports the `ErrGCTTLExceeded` error. It cannot be recovered, and no longer blocks GC safepoint to advance.
 
 The second behavior above is introduced in TiCDC v4.0.13 and later versions. The purpose is to prevent a replication task in TiCDC from suspending for too long, causing the GC safepoint of the upstream TiKV cluster not to continue for a long time and retaining too many outdated data versions, thus affecting the performance of the upstream cluster.
 
@@ -78,7 +78,13 @@ If a replication task starts after the TiCDC service starts, the TiCDC owner upd
 
 If the replication task is suspended longer than the time specified by `gc-ttl`, the replication task enters the `failed` status and cannot be resumed. The PD corresponding service GC safepoint will continue.
 
-The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted.
+The default Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete the data required by TiCDC for continuing replication if the TiCDC service can be recovered within 24 hours after it is interrupted.
+
+## How to recover a replication task after it fails?
+
+1. Use `cdc cli changefeed query` to query the error information of the replication task and fix the error as soon as possible.
+2. Increase the value of `gc-ttl` to allow more time to fix the error, ensuring that the replication task does not enter the `failed` status due to the replication delay exceeding `gc-ttl` after the error is fixed.
+3. After evaluating the impact on the system, increase the value of [`tidb_gc_life_time`](/system-variables.md#tidb_gc_life_time-new-in-v50) in TiDB to block GC and retain data, ensuring that the replication task does not enter the `failed` status due to GC cleaning data after the error is fixed.
 
 ## How to understand the relationship between the TiCDC time zone and the time zones of the upstream/downstream databases?