-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raftstore/*: gc stale peer out of region #1003
Conversation
@@ -158,6 +158,8 @@ pub struct Peer { | |||
// if we remove ourself in ChangePeer remove, we should set this flag, then | |||
// any following committed logs in same Ready should be applied failed. | |||
pending_remove: bool, | |||
pub leader_missing: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a Option instead of these two fields.
if let Some(peer) = self.region_peers.get_mut(®ion_id) { | ||
let duration = peer.since_leader_missing(); | ||
if duration >= self.cfg.max_leader_missing_duration { | ||
info!("leader for peer {} missing for a long time, check with pd whether \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log tag first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to reset the leader missing?
If the peer can't connect to PD or the PD worker doesn't handle it, we may send this message every time in Ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any better idea?
There is another case that the corresponding region info is not available in PD for some time. So the PD worker should retry later (due to this message resent). And raftstore doesn't know how the worker is progressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to care PD problem here.
IMO, if we find a peer is stale, but can't get the result from PD, we should check the peer again after the checking interval time.
d635d70
to
584c172
Compare
if let Some(peer) = self.region_peers.get_mut(®ion_id) { | ||
let duration = peer.since_leader_missing(); | ||
if duration >= self.cfg.max_leader_missing_duration { | ||
info!("peer {} leader missing for a long time, check with pd whether \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag should be at the start of the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
@BusyJay @siddontang |
@@ -444,13 +447,36 @@ impl Peer { | |||
ready.hs.take(); | |||
} | |||
|
|||
if let Some(ref soft_state) = ready.ss { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we receive a message, we will create a replicated peer, and if this peer doesn't receive any message later, the peer can be considered stale too.
But this peer can't be ready, so I think we should check this in tick
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
@siddontang @BusyJay @ngaut PTAL |
// for peer B in case 2 above | ||
// directly destroy peer without data since it doesn't have region range, | ||
// so that it doesn't have the correct region start_key to validate peer with PD | ||
region_to_be_destroyed.push((region_id, peer.peer.clone())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any test to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe no need to push region_id, because peer has it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test case is added for the gc of uninitialized peer.
The region_id is cached for direct call to destroy_peer()
without reference to the peer. I think we should keep it this way.
@BusyJay @siddontang PTAL |
local_region.get_id(), | ||
peer.get_id(), | ||
local_region.get_start_key()); | ||
warn!("xxx local region: {:?}", local_region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unnecessary logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
LGTM |
// before it could successfully receive snapshot from the leader and | ||
// apply that snapshot, no raft ready event will be triggered, | ||
// so that we could not detect the leader is missing for it at here. | ||
if self.is_initialized() && self.leader_missing_time.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary to check if self.leader_missing_time is some or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
@@ -273,13 +273,65 @@ impl<T: Transport, C: PdClient> Store<T, C> { | |||
} | |||
|
|||
fn on_raft_base_tick(&mut self, event_loop: &mut EventLoop<Self>) { | |||
let mut region_to_be_destroyed = vec![]; | |||
for (®ion_id, peer) in &mut self.region_peers { | |||
if !peer.get_store().is_applying_snap() { | |||
peer.raft_group.tick(); | |||
self.pending_raft_groups.insert(region_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not insert into pending_raft_groups if it's about to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
541c0d1
to
cc697a1
Compare
// in the `leader missing` state. That is because if it's isolated from the leader | ||
// before it could successfully receive snapshot from the leader and | ||
// apply that snapshot, no raft ready event will be triggered, | ||
// so that we could not detect the leader is missing for it at here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the uninitialised peer is isolated rather than removed, it may be destroyed before network recovered. And the further heartbeat can't recreate the peer again due to the tombstone. So the region will lose a peer forever. If it happens during creating second replica, the region will not work forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raft works at the time granularity of ms. The distribution and re-balance of PD work at the time granularity of minutes. And the peer gc triggered by leader missing works at hours(or days). They form a relationship of complementation rather than clashing.
If a peer could not be initialized after hours(or days) but still counts for quorum, there must be some fatal error in the system. It doesn't matter that much destroying this peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's user's choice to configure max_leader_missing_duration
. It doesn't have to be at hours or days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point why the configuration is a problem. The default value of max_leader_missing_duration should be a guide to the user. Without enough knowing of how to configure these parameters, any user should not touch the default value.
If the user want to (or accidentally) make a mess of the system, there are a million ways to do it. e.g. he could configure election timeout to the value of min server RTT, so that there will be no leader elected for any raft group and the system fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration is not a problem, the problem is that destroying an uninitialised peer can cause problem. The configuration I mentioned above is to show complementation should not rely on the configuration assumption. If the configuration is valid (pass the validate check), then the group should work in some way(good performance or bad performance).
Besides, even the region is not working for hours, it won't cause any problem at all if the data in it is not frequently visited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the region_id or the peer_id to ask pd if the uninitialised peer is still in the region before destroying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the tombstone key will be destroyed too, see #978
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can even destroy the tombstone key when destroying the too old stale peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the region_id or the peer_id to ask pd if the uninitialised peer is still in the region before destroying.
@BusyJay Currently the PD only provides the interface get region info of key
.
@siddontang has put a comment above:
For an uninitialized peer, the region range is empty, so if you use the start key to search in PD, you will get the first region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recheck the code, when an uninitialised peer is destroyed, the epoch won't be written. So the tombstone won't stop recreating this peer when it's just isolated. So seems it's fine to destroy the peer here.
cc697a1
to
5e5c239
Compare
if self.leader_missing_time.is_none() { | ||
self.leader_missing_time = Some(Instant::now()) | ||
} | ||
} else if self.is_initialized() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the peer is uninitialized, if it receives a raft message from leader, I think we should clear leader_missing_time too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment mentioned, if leader_missing_time is set to None
even for uninitialized peer, when the peer is isolated before initialized, then there would be no ready event triggered, and this peer couldn't be gc anymore.
It's OK to consider peer in this situation as leader missing, since the leader missing timeout is triggered after a long time. For uninitialized peers, the leader is "missing" due to it fails to do the replication job in a specified long time, rather than fails to send heartbeats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not check this in raft tick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know what you mean. The leader missing time is checked in raft tick for every peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why we check here too?
In tick, we can check the leader too, if has leader, clear leader_missing_time, if no leader, set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that, for uninitialized peers:
- A ready event notifies that a leader has emerged, then it "has leader".
- This peer get isolated, and no ready event will be triggered. This peer still keeps its state as "has leader".
Then we don't know when this peer could be gc, even do checking it in raft tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This peer get isolated, and no ready event will be triggered. This peer still keeps its state as "has leader".
In raft, if the follower doesn't receive the message from leader after ElectionTimeout, it will begin to campaign, and clear the leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In raft, if the follower doesn't receive the message from leader after ElectionTimeout, it will begin to campaign, and clear the leader.
It's true for initialized peers, which already has all nodes' info of the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can use is_initialized() && has_leader
to clear the leader_missing_time.
For the un-initialized peer, if it doesn't receive any message from leader for a long time, we can think it is stale.
I suggest checking this in tick too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
LGTM |
Please help to take a look again. @siddontang @BusyJay @ngaut |
for (®ion_id, peer) in &mut self.region_peers { | ||
if !peer.get_store().is_applying() { | ||
peer.raft_group.tick(); | ||
self.pending_raft_groups.insert(region_id); | ||
|
||
// If this peer detects the leader is missing for a long long time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@queenypingcap
Please review the comments.
LGTM |
@@ -36,6 +36,9 @@ const DEFAULT_MGR_GC_TICK_INTERVAL_MS: u64 = 60000; | |||
const DEFAULT_SNAP_GC_TIMEOUT_SECS: u64 = 60 * 10; | |||
const DEFAULT_MESSAGES_PER_TICK: usize = 256; | |||
const DEFAULT_MAX_PEER_DOWN_SECS: u64 = 300; | |||
// If the leader missing time exceeds 2 hours, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+// If the leader is missing for over 2 hours,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address comment.
Signed-off-by: Ping Yu <yuping@pingcap.com>
Fix issue #804