Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace down-peer when peer is down for too long while the store is still connected #7742

Open
AndreMouche opened this issue Jan 22, 2024 · 0 comments
Labels
found/gs type/enhancement The issue or PR belongs to an enhancement.

Comments

@AndreMouche
Copy link
Member

Enhancement Task

Down-peer is detected and reported by TiKV through heartbeat to PD. However, when PD checks the placement-rule and finds there is a down-peer in one region, it will first check if the peer's tikv is down, if not, it will take no action(skip directly without any log).

for _, peer := range rf.Peers {
if c.isDownPeer(region, peer) {
if c.isStoreDownTimeHitMaxDownTime(peer.GetStoreId()) {
ruleCheckerReplaceDownCounter.Inc()
return c.replaceUnexpectRulePeer(region, rf, fit, peer, downStatus)
}
// When witness placement rule is enabled, promotes the witness to voter when region has down voter.
if c.isWitnessEnabled() && core.IsVoter(peer) {
if witness, ok := c.hasAvailableWitness(region, peer); ok {
ruleCheckerPromoteWitnessCounter.Inc()
return operator.CreateNonWitnessPeerOperator("promote-witness-for-down", c.cluster, region, witness)
}
}
}

However, in some cases, when there are issues with the internal region raft group of TiKV, it may cause some replicas to fail to maintain raft heartbeats and result in down-peers, while TiKV can still report heartbeats to PD normally. In this situation, down-peers will continue to exist, resulting in incomplete replica numbers for a long period of time for some regions.

I think in this situation, if a peer has been without heartbeat for a certain period of time(down-peer), regardless of whether tikv is in a down state or not, we should try to recover these down-peers on the PD scheduling side just like replace-offline-peers.

@AndreMouche AndreMouche added type/enhancement The issue or PR belongs to an enhancement. found/gs labels Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found/gs type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

1 participant