TiCDC cluster suffers a round robin owner election during rolling update #3529

amyangfei · 2021-11-18T13:53:12Z

What did you do?

Create a TiCDC cluster with multiple nodes, such as 7 nodes.
Rolling update the TiCDC cluster

What did you expect to see?

Replication continues normally during TiCDC is rolling update

What did you see instead?

Supposing the owner is restarted at first, then owner will be elected to each following TiCDC node(This is caused by the election way in etcd, it simply selects the election key with the smallest revision as the campaign winner), while the elected owner will be restarted soon by rolling update.

The initialization phase of a TiCDC owner could cost long time, it has many procedures, including initializing each existing changefeeds (when initializing a changefeed it will create a downstream sink, imaging we create a Kafka sink and do some verification jobs, it is heavy work).

Then we will waste a lot of time in each TiCDC owner node to do owner initialization. What's more, maybe no owner finishes initialization before it restarts, the replication checkpoint could pause during rolling update, and the longer rolling update takes, the larger replication lag may happen.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.3.0

TiCDC version (execute cdc version):

master@pingcap/ticdc@fe92b89

Brainstorming

When processing rolling update, update the TiCDC owner at last, then the owner will be transferred only once, and before the last node is restarted, the owner works as normal. (This may be simple, just change the deploy tools, but may be not reasonable, why should the deploy tool care about which node is TiCDC owner?)
Separate independent owner nodes, as what DM-master does in DM. We can update owner nodes first, then processor nodes. (This changes existing architecture of TiCDC)
Add an API to notify TiCDC cluster that it will go into a rolling update mode. When TiCDC enters rolling update mode, it will have special logic to avoid too much replication lag. (But this is too tricky)
Introduce a more intelligent leader election strategy? (Then we have to deal with consensus, it is complex)

The text was updated successfully, but these errors were encountered:

amyangfei · 2021-11-21T03:41:00Z

After discussion, we decide to make this a feature, for two reasons

Current behavior doesn't break TiCDC boundary (in extreme scenarios, there checkpoint lag could reach 10 minutes)
In the short term, we tend to update the rolling update strategy in TiUP or TiDB-operator to restart TiCDC owner at last.

overvenus · 2021-12-14T11:25:04Z

Since this issue doesn't break TiCDC boundary, change to severity/moderate.

3AceShowHand · 2022-07-25T03:38:52Z

pingcap/tiup#1972

This is solved by supporting upgrade the owner at the last strategy in the PR above.

amyangfei added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Nov 18, 2021

amyangfei added the severity/major label Nov 18, 2021

amyangfei added the subject/new-feature Denotes an issue or pull request adding a new feature. label Nov 21, 2021

overvenus added severity/moderate and removed severity/major labels Dec 14, 2021

overvenus self-assigned this Dec 14, 2021

overvenus mentioned this issue Mar 18, 2022

Tracking issue for reducing high latency #4757

Closed

37 tasks

3AceShowHand closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiCDC cluster suffers a round robin owner election during rolling update #3529

TiCDC cluster suffers a round robin owner election during rolling update #3529

amyangfei commented Nov 18, 2021 •

edited

Loading

amyangfei commented Nov 21, 2021

overvenus commented Dec 14, 2021

3AceShowHand commented Jul 25, 2022

TiCDC cluster suffers a round robin owner election during rolling update #3529

TiCDC cluster suffers a round robin owner election during rolling update #3529

Comments

amyangfei commented Nov 18, 2021 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

Brainstorming

amyangfei commented Nov 21, 2021

overvenus commented Dec 14, 2021

3AceShowHand commented Jul 25, 2022

amyangfei commented Nov 18, 2021 •

edited

Loading