-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: try to replace the down replicas instead of removing directly #1222
Conversation
if down && sum < 1200 { | ||
// only need to print once | ||
down = false | ||
simutil.Logger.Error("making up replicas don't start immediately") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we abstract a statistic to evaluate different strategies instead of just print a log message? For example, we may count the number of regions with insufficient replicas in the scheduling process.
server/schedule/replica_checker.go
Outdated
@@ -170,7 +172,8 @@ func (r *ReplicaChecker) checkDownPeer(region *core.RegionInfo) *Operator { | |||
if stats.GetDownSeconds() < uint64(r.cluster.GetMaxStoreDownTime().Seconds()) { | |||
continue | |||
} | |||
return CreateRemovePeerOperator("removeDownReplica", r.cluster, OpReplica, region, peer.GetStoreId()) | |||
|
|||
return r.handleReplica(region, peer, "Down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replacePeer
or fixPeer
.
server/schedule/replica_checker.go
Outdated
@@ -251,3 +232,33 @@ func (r *ReplicaChecker) checkBestReplacement(region *core.RegionInfo) *Operator | |||
checkerCounter.WithLabelValues("replica_checker", "new_operator").Inc() | |||
return CreateMovePeerOperator("moveToBetterLocation", r.cluster, region, OpReplica, oldPeer.GetStoreId(), newPeer.GetStoreId(), newPeer.GetId()) | |||
} | |||
|
|||
func (r *ReplicaChecker) handleReplica(region *core.RegionInfo, peer *metapb.Peer, status string) *Operator { | |||
removeExtra := fmt.Sprintf("%s%s%s", "removeExtra", status, "Replica") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about fmt.Sprintf("removeExtra%sReplica", status)
?
@@ -251,3 +232,33 @@ func (r *ReplicaChecker) checkBestReplacement(region *core.RegionInfo) *Operator | |||
checkerCounter.WithLabelValues("replica_checker", "new_operator").Inc() | |||
return CreateMovePeerOperator("moveToBetterLocation", r.cluster, region, OpReplica, oldPeer.GetStoreId(), newPeer.GetStoreId(), newPeer.GetId()) | |||
} | |||
|
|||
func (r *ReplicaChecker) fixPeer(region *core.RegionInfo, peer *metapb.Peer, status string) *Operator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does fixPeer mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means this peer needs to be removed in order to be made up in another store or just simply be replaced. The name before addressing the comment is handleReplica
. There could be a better naming about this.
9878633
to
7381274
Compare
PTAL @nolouch @siddontang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
At present, if a store is down, it will remove peers in the store directly, and then make up replicas. It usually needs some time because it will traverse all the regions.
What is changed and how it works?
This PR tries to move the peers to another store if the original store is down.
Besides, this PR adds a case in the simulator to simulate this kind of situation. It will print an error log without these changes:
making up replicas don't start immediately
.Check List
Tests