ebs br: restore could hang if some tikv nodes are killed or restarted #45206

BornChanger · 2023-07-06T06:32:40Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Kill some TiKV node during ebs br restore phase

2. What did you expect to see? (Required)

EBS BR restore continue and succeed

3. What did you see instead (Required)

EBS BR restore hangs

4. What is your TiDB version? (Required)

TiDB 6.5 and above

BornChanger · 2023-07-06T06:32:55Z

/assign @YuJuncen

YuJuncen · 2023-07-11T07:53:12Z

This is because when we are in recovery mode, all elections will be suspended until BR choose the leader. But the problem is that AFTER BR had chosen the leader, the store got down. Once it reboots, the leaders are dropped. However we are still in recovery mode, so we cannot elect new leaders.

YuJuncen · 2023-07-11T08:07:51Z

A solution might be extending the recovery mode. Make it have 3 stages:

on: the initial stage, which stops raft election and optimize for flashing back.
for_flashback: once BR finished the wait_apply RPC, it will issue a RPC to PD that updating the recovery mode state to for_flashback, but not reboot TiKVs. That means, config of most TiKVs will be in on stage. And rebooted stores can now issue elections.
off or unset: the default, using the unchanged config.

Once BR detected there is a TiKV outage (maybe by creating a no-op TCP connection with the gRPC port of each TiKV), BR will:

If the current recovery mode state is on, retry the whole procedure. (This might be implemented via exit and let operator to restart it.)
If the current recovery mode state is for_flashback, retry from flashback, and operator should reboot all stores, so all stores(If not all stores are rebooted, the rest of stores will reject voting because they believe the old leader's lease hasn't expired) will be able to electing new leaders(Perhaps we also need to resume balance-leader-scheduler at this stage.).
If the current recovery mode state is off, do nothing and exit. (We have already successed!)

YuJuncen · 2023-07-11T08:08:08Z

cc @hicqu , do you have some good ideas?

…45361) close #45206

…45361) (#45721) close #45206

…45361) (#45722) close #45206

BornChanger added the type/bug The issue is confirmed as a bug. label Jul 6, 2023

ti-chi-bot bot assigned YuJuncen Jul 6, 2023

BornChanger mentioned this issue Jul 6, 2023

ebs br: restore hangs when some tikv nodes restarted pingcap/tidb-operator#5151

Closed

jebter added severity/critical component/br This issue is related to BR of TiDB. labels Jul 7, 2023

ti-chi-bot bot added may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jul 7, 2023

jebter added affects-6.5 affects-7.1 and removed may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jul 7, 2023

YuJuncen mentioned this issue Jul 14, 2023

snap_restore: resend recover_region while there are TiKV restarts #45361

Merged

4 tasks

ti-chi-bot bot closed this as completed in #45361 Aug 1, 2023

ti-chi-bot bot pushed a commit that referenced this issue Aug 1, 2023

snap_restore: resend recover_region while there are TiKV restarts (#…

3f9f825

…45361) close #45206

This was referenced Aug 1, 2023

snap_restore: resend recover_region while there are TiKV restarts (#45361) #45721

Merged

snap_restore: resend recover_region while there are TiKV restarts (#45361) #45722

Merged

ti-chi-bot bot pushed a commit that referenced this issue Aug 2, 2023

snap_restore: resend recover_region while there are TiKV restarts (#…

503ab42

…45361) (#45721) close #45206

ti-chi-bot bot pushed a commit that referenced this issue Aug 11, 2023

snap_restore: resend recover_region while there are TiKV restarts (#…

0062953

…45361) (#45722) close #45206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

BornChanger commented Jul 6, 2023

BornChanger commented Jul 6, 2023

YuJuncen commented Jul 11, 2023

YuJuncen commented Jul 11, 2023

YuJuncen commented Jul 11, 2023

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

Comments

BornChanger commented Jul 6, 2023

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

BornChanger commented Jul 6, 2023

YuJuncen commented Jul 11, 2023

YuJuncen commented Jul 11, 2023

YuJuncen commented Jul 11, 2023