raftstore/*: gc stale peer out of region #1003

hhkbp2 · 2016-08-29T13:10:31Z

Fix issue #804

siddontang · 2016-08-30T03:06:45Z

src/raftstore/store/peer.rs

@@ -158,6 +158,8 @@ pub struct Peer {
    // if we remove ourself in ChangePeer remove, we should set this flag, then
    // any following committed logs in same Ready should be applied failed.
    pending_remove: bool,
+    pub leader_missing: bool,


you can use a Option instead of these two fields.

siddontang · 2016-08-30T05:32:22Z

src/raftstore/store/store.rs

+            if let Some(peer) = self.region_peers.get_mut(&region_id) {
+                let duration = peer.since_leader_missing();
+                if duration >= self.cfg.max_leader_missing_duration {
+                    info!("leader for peer {} missing for a long time, check with pd whether \


log tag first.

need to reset the leader missing?
If the peer can't connect to PD or the PD worker doesn't handle it, we may send this message every time in Ready.

Any better idea?
There is another case that the corresponding region info is not available in PD for some time. So the PD worker should retry later (due to this message resent). And raftstore doesn't know how the worker is progressing.

I think we don't need to care PD problem here.
IMO, if we find a peer is stale, but can't get the result from PD, we should check the peer again after the checking interval time.

siddontang · 2016-08-30T09:56:16Z

src/raftstore/store/store.rs

+            if let Some(peer) = self.region_peers.get_mut(&region_id) {
+                let duration = peer.since_leader_missing();
+                if duration >= self.cfg.max_leader_missing_duration {
+                    info!("peer {} leader missing for a long time, check with pd whether \


tag should be at the start of the log.

Address comment.

hhkbp2 · 2016-08-30T15:30:24Z

@BusyJay @siddontang
PTAL closely.

siddontang · 2016-08-31T03:09:44Z

src/raftstore/store/peer.rs

@@ -444,13 +447,36 @@ impl Peer {
            ready.hs.take();
        }

+        if let Some(ref soft_state) = ready.ss {


When we receive a message, we will create a replicated peer, and if this peer doesn't receive any message later, the peer can be considered stale too.
But this peer can't be ready, so I think we should check this in tick function.

Address comment.

hhkbp2 · 2016-09-01T10:09:29Z

@siddontang @BusyJay @ngaut PTAL

siddontang · 2016-09-01T13:16:27Z

src/raftstore/store/store.rs

+                        // for peer B in case 2 above
+                        // directly destroy peer without data since it doesn't have region range,
+                        // so that it doesn't have the correct region start_key to validate peer with PD
+                        region_to_be_destroyed.push((region_id, peer.peer.clone()))


do we have any test to cover this?

maybe no need to push region_id, because peer has it already.

A test case is added for the gc of uninitialized peer.
The region_id is cached for direct call to destroy_peer() without reference to the peer. I think we should keep it this way.

hhkbp2 · 2016-09-05T00:59:14Z

@BusyJay @siddontang PTAL

siddontang · 2016-09-05T01:42:05Z

src/raftstore/store/worker/pd.rs

+                          local_region.get_id(),
+                          peer.get_id(),
+                          local_region.get_start_key());
+                    warn!("xxx local region: {:?}", local_region);


remove unnecessary logs.

Address comment.

siddontang · 2016-09-05T02:59:53Z

LGTM
PTAL @BusyJay

hhkbp2 · 2016-09-07T08:12:27Z

PTAL @BusyJay @zhangjinpeng1987 @ngaut

BusyJay · 2016-09-07T08:13:55Z

src/raftstore/store/peer.rs

+                // before it could successfully receive snapshot from the leader and
+                // apply that snapshot, no raft ready event will be triggered,
+                // so that we could not detect the leader is missing for it at here.
+                if self.is_initialized() && self.leader_missing_time.is_some() {


It's unnecessary to check if self.leader_missing_time is some or not.

Address comment.

BusyJay · 2016-09-07T09:03:55Z

src/raftstore/store/store.rs

@@ -273,13 +273,65 @@ impl<T: Transport, C: PdClient> Store<T, C> {
    }

    fn on_raft_base_tick(&mut self, event_loop: &mut EventLoop<Self>) {
+        let mut region_to_be_destroyed = vec![];
        for (&region_id, peer) in &mut self.region_peers {
            if !peer.get_store().is_applying_snap() {
                peer.raft_group.tick();
                self.pending_raft_groups.insert(region_id);


Should not insert into pending_raft_groups if it's about to be removed.

Address comment.

BusyJay · 2016-09-07T11:28:36Z

src/raftstore/store/peer.rs

+                // in the `leader missing` state. That is because if it's isolated from the leader
+                // before it could successfully receive snapshot from the leader and
+                // apply that snapshot, no raft ready event will be triggered,
+                // so that we could not detect the leader is missing for it at here.


If the uninitialised peer is isolated rather than removed, it may be destroyed before network recovered. And the further heartbeat can't recreate the peer again due to the tombstone. So the region will lose a peer forever. If it happens during creating second replica, the region will not work forever.

Raft works at the time granularity of ms. The distribution and re-balance of PD work at the time granularity of minutes. And the peer gc triggered by leader missing works at hours(or days). They form a relationship of complementation rather than clashing.

If a peer could not be initialized after hours(or days) but still counts for quorum, there must be some fatal error in the system. It doesn't matter that much destroying this peer.

I think it's user's choice to configure max_leader_missing_duration. It doesn't have to be at hours or days.

@ngaut @siddontang

I don't see the point why the configuration is a problem. The default value of max_leader_missing_duration should be a guide to the user. Without enough knowing of how to configure these parameters, any user should not touch the default value.

If the user want to (or accidentally) make a mess of the system, there are a million ways to do it. e.g. he could configure election timeout to the value of min server RTT, so that there will be no leader elected for any raft group and the system fails.

The configuration is not a problem, the problem is that destroying an uninitialised peer can cause problem. The configuration I mentioned above is to show complementation should not rely on the configuration assumption. If the configuration is valid (pass the validate check), then the group should work in some way(good performance or bad performance).

Besides, even the region is not working for hours, it won't cause any problem at all if the data in it is not frequently visited.

Maybe we can use the region_id or the peer_id to ask pd if the uninitialised peer is still in the region before destroying.

yes, the tombstone key will be destroyed too, see #978

I think we can even destroy the tombstone key when destroying the too old stale peer.

Maybe we can use the region_id or the peer_id to ask pd if the uninitialised peer is still in the region before destroying.

@BusyJay Currently the PD only provides the interface get region info of key.
@siddontang has put a comment above:

For an uninitialized peer, the region range is empty, so if you use the start key to search in PD, you will get the first region.

Recheck the code, when an uninitialised peer is destroyed, the epoch won't be written. So the tombstone won't stop recreating this peer when it's just isolated. So seems it's fine to destroy the peer here.

siddontang · 2016-09-08T02:09:03Z

src/raftstore/store/peer.rs

+                if self.leader_missing_time.is_none() {
+                    self.leader_missing_time = Some(Instant::now())
+                }
+            } else if self.is_initialized() {


Even the peer is uninitialized, if it receives a raft message from leader, I think we should clear leader_missing_time too.

As the comment mentioned, if leader_missing_time is set to None even for uninitialized peer, when the peer is isolated before initialized, then there would be no ready event triggered, and this peer couldn't be gc anymore.

It's OK to consider peer in this situation as leader missing, since the leader missing timeout is triggered after a long time. For uninitialized peers, the leader is "missing" due to it fails to do the replication job in a specified long time, rather than fails to send heartbeats.

why not check this in raft tick?

Don't know what you mean. The leader missing time is checked in raft tick for every peer.

so why we check here too?
In tick, we can check the leader too, if has leader, clear leader_missing_time, if no leader, set it.

The problem is that, for uninitialized peers:

A ready event notifies that a leader has emerged, then it "has leader".

This peer get isolated, and no ready event will be triggered. This peer still keeps its state as "has leader".

Then we don't know when this peer could be gc, even do checking it in raft tick.

This peer get isolated, and no ready event will be triggered. This peer still keeps its state as "has leader".

In raft, if the follower doesn't receive the message from leader after ElectionTimeout, it will begin to campaign, and clear the leader.

In raft, if the follower doesn't receive the message from leader after ElectionTimeout, it will begin to campaign, and clear the leader.

It's true for initialized peers, which already has all nodes' info of the cluster.

So we can use is_initialized() && has_leader to clear the leader_missing_time.
For the un-initialized peer, if it doesn't receive any message from leader for a long time, we can think it is stale.
I suggest checking this in tick too.

Address comment.

BusyJay · 2016-09-08T06:40:31Z

LGTM

hhkbp2 · 2016-09-08T08:30:38Z

Please help to take a look again. @siddontang @BusyJay @ngaut

siddontang · 2016-09-08T09:50:13Z

src/raftstore/store/store.rs

        for (&region_id, peer) in &mut self.region_peers {
            if !peer.get_store().is_applying() {
                peer.raft_group.tick();
-                self.pending_raft_groups.insert(region_id);
+
+                // If this peer detects the leader is missing for a long long time,


@queenypingcap
Please review the comments.

BusyJay · 2016-09-09T04:29:26Z

LGTM

QueenyJin · 2016-09-09T06:48:48Z

src/raftstore/store/config.rs

@@ -36,6 +36,9 @@ const DEFAULT_MGR_GC_TICK_INTERVAL_MS: u64 = 60000;
 const DEFAULT_SNAP_GC_TIMEOUT_SECS: u64 = 60 * 10;
 const DEFAULT_MESSAGES_PER_TICK: usize = 256;
 const DEFAULT_MAX_PEER_DOWN_SECS: u64 = 300;
+// If the leader missing time exceeds 2 hours,


+// If the leader is missing for over 2 hours,

Address comment.

Signed-off-by: Ping Yu <yuping@pingcap.com>

siddontang reviewed Aug 30, 2016
View reviewed changes

hhkbp2 mentioned this pull request Aug 30, 2016

Design and Implement Region Merge #1005

Closed

siddontang reviewed Aug 30, 2016
View reviewed changes

hhkbp2 force-pushed the hhkbp2/gc-stale-peer-out-of-region branch from d635d70 to 584c172 Compare August 30, 2016 08:43

siddontang reviewed Aug 30, 2016
View reviewed changes

siddontang reviewed Aug 31, 2016
View reviewed changes

siddontang reviewed Sep 1, 2016
View reviewed changes

siddontang reviewed Sep 5, 2016
View reviewed changes

BusyJay reviewed Sep 7, 2016
View reviewed changes

hhkbp2 force-pushed the hhkbp2/gc-stale-peer-out-of-region branch 3 times, most recently from 541c0d1 to cc697a1 Compare September 7, 2016 10:43

BusyJay reviewed Sep 7, 2016
View reviewed changes

gc stale peer out of region

5e5c239

hhkbp2 force-pushed the hhkbp2/gc-stale-peer-out-of-region branch from cc697a1 to 5e5c239 Compare September 7, 2016 12:15

siddontang reviewed Sep 8, 2016
View reviewed changes

bundle up codes

9476866

siddontang reviewed Sep 8, 2016
View reviewed changes

QueenyJin reviewed Sep 9, 2016
View reviewed changes

revise comments

0114cb9

hhkbp2 merged commit 10b7b88 into master Sep 12, 2016

hhkbp2 deleted the hhkbp2/gc-stale-peer-out-of-region branch September 12, 2016 07:36

sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019

iosmanthus pushed a commit to iosmanthus/tikv that referenced this pull request Oct 30, 2023

load_data: retry on region boundary error (tikv#1003)

8b28e54

Signed-off-by: Ping Yu <yuping@pingcap.com>

raftstore/*: gc stale peer out of region #1003

raftstore/*: gc stale peer out of region #1003

Conversation

hhkbp2 commented Aug 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddontang Aug 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 commented Aug 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 commented Sep 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 commented Sep 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddontang commented Sep 5, 2016

hhkbp2 commented Sep 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 Sep 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 Sep 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhkbp2 Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay commented Sep 8, 2016

hhkbp2 commented Sep 8, 2016

Choose a reason for hiding this comment

BusyJay commented Sep 9, 2016

QueenyJin Sep 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddontang Aug 30, 2016 •

edited

Loading

hhkbp2 Sep 7, 2016 •

edited

Loading

hhkbp2 Sep 7, 2016 •

edited

Loading

hhkbp2 Sep 8, 2016 •

edited

Loading

hhkbp2 Sep 8, 2016 •

edited

Loading

QueenyJin Sep 9, 2016 •

edited

Loading