Skip to content

Latest commit

 

History

History
199 lines (102 loc) · 4.51 KB

ticdc-alert-rules.md

File metadata and controls

199 lines (102 loc) · 4.51 KB
title summary
TiCDC Alert Rules
Learn about TiCDC alert rules and how to handle the alerts.

TiCDC Alert Rules

This document describes the TiCDC alert rules and the corresponding solutions. In descending order, the severity levels are: Critical, Warning.

Critical alerts

This section introduces critical alerts and solutions.

cdc_checkpoint_high_delay

For critical alerts, you need to pay close attention to abnormal monitoring metrics.

cdc_resolvedts_high_delay

ticdc_changefeed_failed

  • Alert rule:

    (max_over_time(ticdc_owner_status[1m]) == 2) > 0

  • Description:

    A replication task encounters an unrecoverable error and enters the failed state.

  • Solution:

    This alert is similar to replication interruption. See TiCDC Handles Replication Interruption.

ticdc_processor_exit_with_error_count

Warning alerts

Warning alerts are a reminder for an issue or error.

cdc_multiple_owners

  • Alert rule:

    sum(rate(ticdc_owner_ownership_counter[30s])) >= 2

  • Description:

    There are multiple owners in the TiCDC cluster.

  • Solution:

    Collect TiCDC logs to locate the root cause.

cdc_sink_flush_duration_time_more_than_10s

  • Alert rule:

    histogram_quantile(0.9, rate(ticdc_sink_txn_worker_flush_duration[1m])) > 10

  • Description:

    It takes a replication task more than 10 seconds to write data to the downstream database.

  • Solution:

    Check whether there are problems in the downstream database.

cdc_processor_checkpoint_tso_no_change_for_1m

ticdc_puller_entry_sorter_sort_bucket

  • Alert rule:

    histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_sort_bucket{}[1m])) > 1

  • Description:

    The delay of TiCDC puller entry sorter is too high.

  • Solution:

    Collect TiCDC logs to locate the root cause.

ticdc_puller_entry_sorter_merge_bucket

  • Alert rule:

    histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_merge_bucket{}[1m])) > 1

  • Description:

    The delay of TiCDC puller entry sorter merge is too high.

  • Solution:

    Collect TiCDC logs to locate the root cause.

tikv_cdc_min_resolved_ts_no_change_for_1m

  • Alert rule:

    changes(tikv_cdc_min_resolved_ts[1m]) < 1 and ON (instance) tikv_cdc_region_resolve_status{status="resolved"} > 0 and ON (instance) tikv_cdc_captured_region_total > 0

  • Description:

    The minimum Resolved TS 1 of TiKV CDC has not advanced for 1 minute.

  • Solution:

    Collect TiKV logs to locate the root cause.

tikv_cdc_scan_duration_seconds_more_than_10min

  • Alert rule:

    histogram_quantile(0.9, rate(tikv_cdc_scan_duration_seconds_bucket{}[1m])) > 600

  • Description:

    The TiKV CDC module has scanned for incremental replication for more than 10 minutes.

  • Solution:

    Collect TiCDC monitoring metrics and TiKV logs to locate the root cause.

ticdc_sink_mysql_execution_error

  • Alert rule:

    changes(ticdc_sink_mysql_execution_error[1m]) > 0

  • Description:

    An error occurs when a replication task writes data to the downstream MySQL.

  • Solution:

    There are many possible root causes. See Troubleshoot TiCDC.

ticdc_memory_abnormal

  • Alert rule:

    go_memstats_heap_alloc_bytes{job="ticdc"} > 1e+10

  • Description:

    The TiCDC heap memory usage exceeds 10 GiB.

  • Solution:

    Collect TiCDC logs to locate the root cause.