-
Notifications
You must be signed in to change notification settings - Fork 936
Add support for recovery of async/semisync replicas of failed replication group members #1254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see inline comments
// failure of a group member with replicas is akin to failure of an intermediate master. | ||
func checkAndRecoverDeadGroupMemberWithReplicas(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (bool, *TopologyRecovery, error) { | ||
// Don't proceed with recovery unless it was forced or automatic intermediate source recovery is enabled. | ||
// We consider failed group members akin to failed intermediate masters, so we re-use the configuration for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we re-use the configuration
but in analysis_dao.go
it seems like you've changed that: intermediate master recovery only takes place under
if !a.IsReplicationGroupMember {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean here is that we re-use analysisEntry.ClusterDetails.HasAutomatedIntermediateMasterRecovery
configuration to decide whether we want to fail-over group members as opposed to having a separate configuration. As mentioned in the method's doc comment, we are operating under the assumption that group secondaries with replicas are akin to intermediate masters in the sense that they perform a very similar function in the replication chain; get and apply changes from the primary (except, via GR instead of binlog), and distribute them to replicas (via the binlog). I hope this clarifies my intent.
topologyRecovery.SuccessorKey = &recoveredToGroupMember.Key | ||
topologyRecovery.SuccessorAlias = recoveredToGroupMember.InstanceAlias | ||
// For the same reasons that were mentioned above, we re-use the post intermediate master fail-over hooks | ||
executeProcesses(config.Config.PostIntermediateMasterFailoverProcesses, "PostIntermediateMasterFailoverProcesses", topologyRecovery, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't run Group Replication myself, but I think it can be debatable whether it is correct to run PostIntermediateMasterFailoverProcesses
. For now, let's keep it at that, but I predict that someone in the future will argue against this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, our use case does not seem to require different hooks for these. If the need arises (or someone comes knocking at your door asking for it) I'd be happy to change this to have different GR and intermediate source hooks.
…tion group members.
Related issue: #1253
Description
This PR addresses the issue mentioned above. It does so by adding failure detection and recovery for replication group members that have traditional async/semi-sync replicas.
cc @sjmudd, @dveeden, @luisyonaldo.