replace down-peer when peer is down for too long while the store is still connected #7742

AndreMouche · 2024-01-22T02:49:02Z

Enhancement Task

Down-peer is detected and reported by TiKV through heartbeat to PD. However, when PD checks the placement-rule and finds there is a down-peer in one region, it will first check if the peer's tikv is down, if not, it will take no action(skip directly without any log).

pd/pkg/schedule/checker/rule_checker.go

Lines 190 to 203 in da5a4e9

    
           for _, peer := range rf.Peers { 
        
           	if c.isDownPeer(region, peer) { 
        
           		if c.isStoreDownTimeHitMaxDownTime(peer.GetStoreId()) { 
        
           			ruleCheckerReplaceDownCounter.Inc() 
        
           			return c.replaceUnexpectRulePeer(region, rf, fit, peer, downStatus) 
        
           		} 
        
           		// When witness placement rule is enabled, promotes the witness to voter when region has down voter. 
        
           		if c.isWitnessEnabled() && core.IsVoter(peer) { 
        
           			if witness, ok := c.hasAvailableWitness(region, peer); ok { 
        
           				ruleCheckerPromoteWitnessCounter.Inc() 
        
           				return operator.CreateNonWitnessPeerOperator("promote-witness-for-down", c.cluster, region, witness) 
        
           			} 
        
           		} 
        
           	}

However, in some cases, when there are issues with the internal region raft group of TiKV, it may cause some replicas to fail to maintain raft heartbeats and result in down-peers, while TiKV can still report heartbeats to PD normally. In this situation, down-peers will continue to exist, resulting in incomplete replica numbers for a long period of time for some regions.

I think in this situation, if a peer has been without heartbeat for a certain period of time（down-peer）, regardless of whether tikv is in a down state or not, we should try to recover these down-peers on the PD scheduling side just like replace-offline-peers.

AndreMouche added type/enhancement The issue or PR belongs to an enhancement. found/gs labels Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace down-peer when peer is down for too long while the store is still connected #7742

replace down-peer when peer is down for too long while the store is still connected #7742

AndreMouche commented Jan 22, 2024

replace down-peer when peer is down for too long while the store is still connected #7742

replace down-peer when peer is down for too long while the store is still connected #7742

Comments

AndreMouche commented Jan 22, 2024

Enhancement Task