From 362865abe36b5771d525310b449c07a57da7ef85 Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Wed, 13 Nov 2019 23:13:56 +0800 Subject: [PATCH 1/8] add RFC for Follower Replication Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 107 ++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 text/2019-11-13-follower-replication.md diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md new file mode 100644 index 00000000..6ed73c21 --- /dev/null +++ b/text/2019-11-13-follower-replication.md @@ -0,0 +1,107 @@ +# Follower Replication + +## Summary +This RFC introduces a new mechanism in Raft Protocol which allows a follower to send raft logs to other followers and learners. The target of this feature is to reduce network transmission costs between different data centers in Log Replication. + +## Motivation +In the origin raft protocol, a follower can only receive new raft logs and snapshots from the leader, which could be insufficient in some situations. For example, when a raft cluster is distributed in different data centers, log replication between a new node and the leader is expensive as they are located at different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. + +## Detailed design +There are two key concepts of design: +- The leader must know each node’s data center in the raft cluster. +- The leader is able to ask a follower to send raft logs or a snapshot to other followers. + +### The Group and Groups +We introduce a new concept *group* that represents all the raft nodes in a datacenter and any node which belongs to the group can be called a *group member*. Every raft node should have a new configuration called *Groups* which contains all the groups as the name is. For example, in a 2DC based raft cluster with 5 nodes ABCDE where A, B, and C are at DC1 and DE are at DC2, the `Groups` might be organized like: +``` +Group1: DC1 -> [A, B, C] +Group2: DC2 -> [D, E] +``` +The leader must be easy to know whether or not nodes belong to the same group or get all the group members. Since `Groups` is a configuration that will never be persistent in Storage and is volatile, a raft node exposes the ability to modify it on the fly with the benefit of flexibility on the upper layer. + +### The Delegate and Commission +As the `Groups` is configured, the leader is able to choose a group member as a *delegate* to send entries to one or the rest group members in Log Replication, which is called Follower Replication. To implement Follower Replication, we basically need three main steps: + +1. The leader picks a group member as a delegate of the group +2. The leader prepares and sends some *commission*s that indicate what entries the picked delegate will send to others by a new message type MsgBroadcast. +3. The delegate receives this message and executes the commissions + +Here is a diagram showing how Follower Replication works: +``` + +-----------+ + | | + +--------------------------------+ C | + | MsgAppendResp | | + | +-----^-----+ + | | + | | MsgAppend(e[4,5]) + | | ++------v-----+ +-----+-----+ +| | | | +| leader +--------------------------> B | delegate +| | MsgBroadcast{ | | ++------^-----+ e[2,5] +-----+-----+ + | Commission(C, [4,5]) | + | Commission(D, [3,5]) | MsgAppend(e[3,5]) + | } | + | +-----v-----+ + | | | + +--------------------------------+ D | + MsgAppendResp | | + +-----------+ + +``` +Note: +- `e[x, y]` stands for all the entries within an index range [x, y] (both inclusive) +- `Commission(target, [x, y])` stands for a job that the delegate should send `e[x, y]` to the target + +#### Choose a Delegate +Since a delegate is responsible to send entries to other group members, it must contain entries as much as possible to complete the commission. The leader follows a bunch of rules to choose a delegate of a group: +1. If all the members are requiring snapshots, choose the delegate randomly. +2. Among the members who are requiring new entries, choose the node satisfies conditions below : + + 1. Must be `recent_active` + 2. The progress state should be `Replicate` but not `paused` + 3. The progress has the smallest `match_index` + +3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set by default except in some cases such as massively large group, which is able to be controlled by upper layer. + +And let's talk about these rules carefully: +For rule 1, It's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. + +In most cases, what `recent_active` is true means the node keeps communicating with the leader if `check_quorum` is enabled so rule 2.1 is reasonable. + +As for rule 2.2, the point is that in `raft-rs` the `Progress` of a node has a flow control mechanism and the leader shouldn’t send messages to a node with `paused` Progress. And `Replicate` state indicates a node is continuously receiving raft logs from the leader, which means this node is somewhat 'healthy' in the viewpoint of the leader. + +The rule 2.3 is a little bit subtle. If a node passes rule 2.1 and 2.2, we can say it’s an active node with a smooth network connection. Under these circumstances, the node with the smallest `match_index` may have the greatest chance of having enough entries to be sent to every other group member. + +If the delegate of a group is determined, it’ll be cached in memory for later usage until Groups configuration is changed or the delegate is evicted by rejection message. And the flow control should be valid for any cached delegate. + +#### MsgBroadcast and Commission +After the delegate is picked, the leader will prepare a `MsgBroadcast` message which is sent to the delegate. A MsgBroadcast just looks like a `MsgAppend` or `MsgSnapshot` to be sent to the delegate, which often only carries entries the delegate needs but also includes a collection of commissions. But if some members are requiring entries and others are requiring snapshots, the `MsgBroadcast` be generated with both entries and a snapshot. +A *commission* just describes the range of entries or a snapshot that should be sent to the target. It looks like: +``` +Commission { + type // Only two types: Append or Snapshot + target // The target group member + last_index + log_term +} +``` + +You can just treat a commission as the metadata of a MsgAppend or MsgSnapshot since the `last_index` and `log_term` are set by the leader according to its progress set. + +When the delegate receives a `MsgBroadcast`, it might meet any scenario below: +1. If the message declares that the delegate needs a snapshot, it means that all the group members are requiring snapshots. Every commission type should be Snapshot and the delegate just broadcasts to others. +2. If the message declares that the delegate needs entries, it first tried to append incoming entries to its raft log. As the origin raft protocol, if the appending fails, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send *commission*s again. +3. If entires appending in step 2 succeeds, the delegate will try to execute all the *commission*s. And some *commission* executions could be failed due to the stale progress state in the leader node. The delegate will collect all the failed *commission*s and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. + +And it’s significant to update the inflight of progress when the corresponding *commission* is generated and rollback the inflight once the leader receives failed *commission*s in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. + +## Drawbacks +When there is a large group or many groups, the log committing speed could be slow as entries will be sent to delegate first. + +## Alternatives + +## Unresolved questions +The rule 2.3 for choosing a delegate might need some rethink. From cb4970feb5237e4d42763a72090d3f45820e86cc Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Fri, 15 Nov 2019 11:08:02 +0800 Subject: [PATCH 2/8] format Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 40 ++++++++++++++++++------- 1 file changed, 29 insertions(+), 11 deletions(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index 6ed73c21..1918b639 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -1,25 +1,33 @@ # Follower Replication ## Summary -This RFC introduces a new mechanism in Raft Protocol which allows a follower to send raft logs to other followers and learners. The target of this feature is to reduce network transmission costs between different data centers in Log Replication. + +This RFC introduces a new mechanism in Raft Protocol which allows a follower to send raft logs to other followers and learners. The target of this feature is to reduce network transmission costs between different data centers in Log Replication. ## Motivation + In the origin raft protocol, a follower can only receive new raft logs and snapshots from the leader, which could be insufficient in some situations. For example, when a raft cluster is distributed in different data centers, log replication between a new node and the leader is expensive as they are located at different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. ## Detailed design + There are two key concepts of design: + - The leader must know each node’s data center in the raft cluster. - The leader is able to ask a follower to send raft logs or a snapshot to other followers. ### The Group and Groups -We introduce a new concept *group* that represents all the raft nodes in a datacenter and any node which belongs to the group can be called a *group member*. Every raft node should have a new configuration called *Groups* which contains all the groups as the name is. For example, in a 2DC based raft cluster with 5 nodes ABCDE where A, B, and C are at DC1 and DE are at DC2, the `Groups` might be organized like: + +We introduce a new concept *group* that represents all the raft nodes in a datacenter and any node which belongs to the group can be called a *group member*. Every raft node should have a new configuration called *Groups* which contains all the groups as the name is. For example, in a 2DC based raft cluster with 5 nodes ABCDE where A, B, and C are at DC1 and DE are at DC2, the `Groups` might be organized like: + ``` Group1: DC1 -> [A, B, C] Group2: DC2 -> [D, E] ``` -The leader must be easy to know whether or not nodes belong to the same group or get all the group members. Since `Groups` is a configuration that will never be persistent in Storage and is volatile, a raft node exposes the ability to modify it on the fly with the benefit of flexibility on the upper layer. + +The leader must be easy to know whether or not nodes belong to the same group or get all the group members. Since `Groups` is a configuration that will never be persistent in Storage and is volatile, a raft node exposes the ability to modify it on the fly with the benefit of flexibility on the upper layer. ### The Delegate and Commission + As the `Groups` is configured, the leader is able to choose a group member as a *delegate* to send entries to one or the rest group members in Log Replication, which is called Follower Replication. To implement Follower Replication, we basically need three main steps: 1. The leader picks a group member as a delegate of the group @@ -27,6 +35,7 @@ As the `Groups` is configured, the leader is able to choose a group member as a 3. The delegate receives this message and executes the commissions Here is a diagram showing how Follower Replication works: + ``` +-----------+ | | @@ -51,12 +60,16 @@ Here is a diagram showing how Follower Replication works: +-----------+ ``` -Note: + +Note: + - `e[x, y]` stands for all the entries within an index range [x, y] (both inclusive) - `Commission(target, [x, y])` stands for a job that the delegate should send `e[x, y]` to the target #### Choose a Delegate + Since a delegate is responsible to send entries to other group members, it must contain entries as much as possible to complete the commission. The leader follows a bunch of rules to choose a delegate of a group: + 1. If all the members are requiring snapshots, choose the delegate randomly. 2. Among the members who are requiring new entries, choose the node satisfies conditions below : @@ -67,24 +80,26 @@ Since a delegate is responsible to send entries to other group members, it must 3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set by default except in some cases such as massively large group, which is able to be controlled by upper layer. And let's talk about these rules carefully: -For rule 1, It's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. +For rule 1, It's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. In most cases, what `recent_active` is true means the node keeps communicating with the leader if `check_quorum` is enabled so rule 2.1 is reasonable. As for rule 2.2, the point is that in `raft-rs` the `Progress` of a node has a flow control mechanism and the leader shouldn’t send messages to a node with `paused` Progress. And `Replicate` state indicates a node is continuously receiving raft logs from the leader, which means this node is somewhat 'healthy' in the viewpoint of the leader. -The rule 2.3 is a little bit subtle. If a node passes rule 2.1 and 2.2, we can say it’s an active node with a smooth network connection. Under these circumstances, the node with the smallest `match_index` may have the greatest chance of having enough entries to be sent to every other group member. +The rule 2.3 is a little bit subtle. If a node passes rule 2.1 and 2.2, we can say it’s an active node with a smooth network connection. Under these circumstances, the node with the smallest `match_index` may have the greatest chance of having enough entries to be sent to every other group member. If the delegate of a group is determined, it’ll be cached in memory for later usage until Groups configuration is changed or the delegate is evicted by rejection message. And the flow control should be valid for any cached delegate. #### MsgBroadcast and Commission + After the delegate is picked, the leader will prepare a `MsgBroadcast` message which is sent to the delegate. A MsgBroadcast just looks like a `MsgAppend` or `MsgSnapshot` to be sent to the delegate, which often only carries entries the delegate needs but also includes a collection of commissions. But if some members are requiring entries and others are requiring snapshots, the `MsgBroadcast` be generated with both entries and a snapshot. A *commission* just describes the range of entries or a snapshot that should be sent to the target. It looks like: + ``` Commission { type // Only two types: Append or Snapshot target // The target group member - last_index + last_index log_term } ``` @@ -92,16 +107,19 @@ Commission { You can just treat a commission as the metadata of a MsgAppend or MsgSnapshot since the `last_index` and `log_term` are set by the leader according to its progress set. When the delegate receives a `MsgBroadcast`, it might meet any scenario below: + 1. If the message declares that the delegate needs a snapshot, it means that all the group members are requiring snapshots. Every commission type should be Snapshot and the delegate just broadcasts to others. -2. If the message declares that the delegate needs entries, it first tried to append incoming entries to its raft log. As the origin raft protocol, if the appending fails, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send *commission*s again. -3. If entires appending in step 2 succeeds, the delegate will try to execute all the *commission*s. And some *commission* executions could be failed due to the stale progress state in the leader node. The delegate will collect all the failed *commission*s and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. +2. If the message declares that the delegate needs entries, it first tried to append incoming entries to its raft log. As the origin raft protocol. If the appending fails, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send commissions again. +3. If entires appending in step 2 succeeds, the delegate will try to execute all the commissions. And some of them could be failed due to the stale progress state in the leader node. The delegate will collect all the failed commissions and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. -And it’s significant to update the inflight of progress when the corresponding *commission* is generated and rollback the inflight once the leader receives failed *commission*s in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. +And it’s significant to update the inflight of progress when the corresponding commission is generated and rollback the inflight once the leader receives failed commissions in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. ## Drawbacks -When there is a large group or many groups, the log committing speed could be slow as entries will be sent to delegate first. + +When there is a large group or many groups, the log committing speed could be slow as entries will be sent to delegate first. ## Alternatives ## Unresolved questions + The rule 2.3 for choosing a delegate might need some rethink. From 58c83f7e5a290852af0010f9f945c611704d5d6f Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Fri, 15 Nov 2019 11:22:34 +0800 Subject: [PATCH 3/8] revise Motivation Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index 1918b639..d7f29777 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -6,7 +6,7 @@ This RFC introduces a new mechanism in Raft Protocol which allows a follower to ## Motivation -In the origin raft protocol, a follower can only receive new raft logs and snapshots from the leader, which could be insufficient in some situations. For example, when a raft cluster is distributed in different data centers, log replication between a new node and the leader is expensive as they are located at different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. +In the origin raft protocol, a follower can only receive new raft logs and snapshots from the leader, which could be insufficient in some situations. For example, when a raft cluster is distributed in different data centers, log replication between a new node and the leader is expensive as they are located at different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. And in a cluster with massive nodes, the leader can have lower system load due to the less operations in network transmission. ## Detailed design From 1e46be1ccb67f837a5952805c438010fefc4087a Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Fri, 15 Nov 2019 22:25:47 +0800 Subject: [PATCH 4/8] make things more clear Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index d7f29777..dfd87e4c 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -109,10 +109,12 @@ You can just treat a commission as the metadata of a MsgAppend or MsgSnapshot si When the delegate receives a `MsgBroadcast`, it might meet any scenario below: 1. If the message declares that the delegate needs a snapshot, it means that all the group members are requiring snapshots. Every commission type should be Snapshot and the delegate just broadcasts to others. -2. If the message declares that the delegate needs entries, it first tried to append incoming entries to its raft log. As the origin raft protocol. If the appending fails, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send commissions again. -3. If entires appending in step 2 succeeds, the delegate will try to execute all the commissions. And some of them could be failed due to the stale progress state in the leader node. The delegate will collect all the failed commissions and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. +2. If the message declares that the delegate needs entries, just like the origin raft protocol, it first tries to append incoming entries to its unstable raft log. And if a conflict ocurrs, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send commissions again. +3. If step 2 succeeds, the delegate will try to execute all the commissions by gathering correspond entries and sending them to each group member. During this time, some of commissions could be failed due to the stale progress state in the leader node. The delegate will collect all the failed commissions and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. -And it’s significant to update the inflight of progress when the corresponding commission is generated and rollback the inflight once the leader receives failed commissions in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. +All the other group members have no idea where a `MsgAppend` or `MsgSnapshot` comes from and they just handles messages as normal followers because the delegate will replace the `from` of a message with the leader ID when executing a commission. + +It’s significant to update the inflight of progress when the corresponding commission is generated and rollback the inflight once the leader receives failed commissions in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. ## Drawbacks From 1027e21a22de5b0fcdf396e589fd36c5c22879af Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Fri, 15 Nov 2019 22:45:56 +0800 Subject: [PATCH 5/8] revise detailed design Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index dfd87e4c..36b26f31 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -112,7 +112,7 @@ When the delegate receives a `MsgBroadcast`, it might meet any scenario below: 2. If the message declares that the delegate needs entries, just like the origin raft protocol, it first tries to append incoming entries to its unstable raft log. And if a conflict ocurrs, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send commissions again. 3. If step 2 succeeds, the delegate will try to execute all the commissions by gathering correspond entries and sending them to each group member. During this time, some of commissions could be failed due to the stale progress state in the leader node. The delegate will collect all the failed commissions and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. -All the other group members have no idea where a `MsgAppend` or `MsgSnapshot` comes from and they just handles messages as normal followers because the delegate will replace the `from` of a message with the leader ID when executing a commission. +All the other group members have no idea whether a `MsgAppend` or `MsgSnapshot` comes from the leader or the delegate because the delegate always replaces the `from` of the message with the leader ID when executing a commission. So they finally handle these messages like normal followers. It’s significant to update the inflight of progress when the corresponding commission is generated and rollback the inflight once the leader receives failed commissions in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. From 9c24706cafc4a64c01088ed420785220f7c22c81 Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Mon, 18 Nov 2019 17:10:01 +0800 Subject: [PATCH 6/8] address comments Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index 36b26f31..5b786814 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -77,10 +77,10 @@ Since a delegate is responsible to send entries to other group members, it must 2. The progress state should be `Replicate` but not `paused` 3. The progress has the smallest `match_index` -3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set by default except in some cases such as massively large group, which is able to be controlled by upper layer. +3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set by default except in some cases like meeting a massively large group. And the feature of enabling choosing delegate of such 'leader group' can be controlled by upper layer. -And let's talk about these rules carefully: -For rule 1, It's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. + +For rule 1, it's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. In most cases, what `recent_active` is true means the node keeps communicating with the leader if `check_quorum` is enabled so rule 2.1 is reasonable. @@ -92,7 +92,7 @@ If the delegate of a group is determined, it’ll be cached in memory for later #### MsgBroadcast and Commission -After the delegate is picked, the leader will prepare a `MsgBroadcast` message which is sent to the delegate. A MsgBroadcast just looks like a `MsgAppend` or `MsgSnapshot` to be sent to the delegate, which often only carries entries the delegate needs but also includes a collection of commissions. But if some members are requiring entries and others are requiring snapshots, the `MsgBroadcast` be generated with both entries and a snapshot. +After the delegate is picked, the leader will prepare a `MsgBroadcast` message which is sent to the delegate. A MsgBroadcast just looks like a `MsgAppend` or `MsgSnapshot` to be sent to the delegate, which often only carries entries the delegate needs but also includes a collection of commissions. But if some members are requiring entries and others are requiring snapshots, the `MsgBroadcast` with be generated with both entries and a snapshot. A *commission* just describes the range of entries or a snapshot that should be sent to the target. It looks like: ``` From 18542815c75390b75ccaa0768c1e10beaa0f0edb Mon Sep 17 00:00:00 2001 From: qupeng Date: Fri, 17 Jan 2020 13:54:04 +0800 Subject: [PATCH 7/8] update for the implementation is changed (#1) * update for the implementation is changed Signed-off-by: qupeng * address comments Signed-off-by: qupeng --- text/2019-11-13-follower-replication.md | 87 +++++++++---------------- 1 file changed, 32 insertions(+), 55 deletions(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index 5b786814..7304542d 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -2,26 +2,28 @@ ## Summary -This RFC introduces a new mechanism in Raft Protocol which allows a follower to send raft logs to other followers and learners. The target of this feature is to reduce network transmission costs between different data centers in Log Replication. +This RFC introduces a new mechanism for Raft protocol which allows a follower to send raft logs and snapshots to other followers. The target of this feature is to reduce network transmission cost between different data centers in Log Replication. ## Motivation -In the origin raft protocol, a follower can only receive new raft logs and snapshots from the leader, which could be insufficient in some situations. For example, when a raft cluster is distributed in different data centers, log replication between a new node and the leader is expensive as they are located at different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. And in a cluster with massive nodes, the leader can have lower system load due to the less operations in network transmission. +In the origin Raft protocol, a follower can only receive raft logs and snapshots from its leader, which could be insufficient in some situations. For example, when a Raft group is distributed in different data centers, log replication between a new node and the leader could be expensive if they are in different data centers. In this case, internal follower-to-follower transfer in one data center can be far more efficient than the traditionally stipulated leader-to-follower transfer. And in a cluster with massive nodes, the leader can have lower system load due to the less operations in network transmission. ## Detailed design -There are two key concepts of design: +There are four key concepts of design: -- The leader must know each node’s data center in the raft cluster. -- The leader is able to ask a follower to send raft logs or a snapshot to other followers. +- Every peer in a Raft group is associated with a `group_id`. +- Follower-to-follower data transfer is only allowed for peers in same group. +- Peers report their `group_id` to their leader by Raft messages. +- Leader is able to ask a follower to replicate data to other followers. ### The Group and Groups -We introduce a new concept *group* that represents all the raft nodes in a datacenter and any node which belongs to the group can be called a *group member*. Every raft node should have a new configuration called *Groups* which contains all the groups as the name is. For example, in a 2DC based raft cluster with 5 nodes ABCDE where A, B, and C are at DC1 and DE are at DC2, the `Groups` might be organized like: +We introduce a new concept *group* that represents all the raft nodes in a data center and any node which belongs to the group can be called a *group member*. Every raft node should have a new configuration called *Groups* which contains all the groups as the name is. For example, in a 2DC based raft cluster with 5 nodes ABCDE where A, B, and C are at DC1 and DE are at DC2, the `Groups` might be organized like: ``` -Group1: DC1 -> [A, B, C] -Group2: DC2 -> [D, E] +Group1: [A, B, C] +Group2: [D, E] ``` The leader must be easy to know whether or not nodes belong to the same group or get all the group members. Since `Groups` is a configuration that will never be persistent in Storage and is volatile, a raft node exposes the ability to modify it on the fly with the benefit of flexibility on the upper layer. @@ -30,9 +32,9 @@ The leader must be easy to know whether or not nodes belong to the same group or As the `Groups` is configured, the leader is able to choose a group member as a *delegate* to send entries to one or the rest group members in Log Replication, which is called Follower Replication. To implement Follower Replication, we basically need three main steps: -1. The leader picks a group member as a delegate of the group -2. The leader prepares and sends some *commission*s that indicate what entries the picked delegate will send to others by a new message type MsgBroadcast. -3. The delegate receives this message and executes the commissions +1. The leader picks a group member as a delegate of the group. +2. When processing Log Replication to a delegate, the leader must specify peers whom the delegate should replicate logs to. And the leader never sends raft log to the specified peers. +3. The delegate replicates entries to specified peers based on its own progress. Here is a diagram showing how Follower Replication works: @@ -48,11 +50,11 @@ Here is a diagram showing how Follower Replication works: +------v-----+ +-----+-----+ | | | | | leader +--------------------------> B | delegate -| | MsgBroadcast{ | | +| | MsgAppend { | | +------^-----+ e[2,5] +-----+-----+ - | Commission(C, [4,5]) | - | Commission(D, [3,5]) | MsgAppend(e[3,5]) - | } | + | BcastTargets [C, D] | + | } | MsgAppend(e[3,5]) + | | | +-----v-----+ | | | +--------------------------------+ D | @@ -64,64 +66,39 @@ Here is a diagram showing how Follower Replication works: Note: - `e[x, y]` stands for all the entries within an index range [x, y] (both inclusive) -- `Commission(target, [x, y])` stands for a job that the delegate should send `e[x, y]` to the target +- `BcastTargets` is a name list of delegated peers. #### Choose a Delegate -Since a delegate is responsible to send entries to other group members, it must contain entries as much as possible to complete the commission. The leader follows a bunch of rules to choose a delegate of a group: +Since a delegate is responsible to send entries to other group members, it must contain entries as more as possible to complete the commission. The leader follows a bunch of rules to choose a delegate of a group: 1. If all the members are requiring snapshots, choose the delegate randomly. -2. Among the members who are requiring new entries, choose the node satisfies conditions below : - - 1. Must be `recent_active` - 2. The progress state should be `Replicate` but not `paused` - 3. The progress has the smallest `match_index` - -3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set by default except in some cases like meeting a massively large group. And the feature of enabling choosing delegate of such 'leader group' can be controlled by upper layer. - +2. Among the members who are requiring new entries, choose the node satisfies conditions below: -For rule 1, it's obvious that a node requiring a snapshot is not a proper choice because its progress is too far behind and its raft logs are highly possible to be stale. + 1. The progress has the maximal `match_index`. -In most cases, what `recent_active` is true means the node keeps communicating with the leader if `check_quorum` is enabled so rule 2.1 is reasonable. - -As for rule 2.2, the point is that in `raft-rs` the `Progress` of a node has a flow control mechanism and the leader shouldn’t send messages to a node with `paused` Progress. And `Replicate` state indicates a node is continuously receiving raft logs from the leader, which means this node is somewhat 'healthy' in the viewpoint of the leader. - -The rule 2.3 is a little bit subtle. If a node passes rule 2.1 and 2.2, we can say it’s an active node with a smooth network connection. Under these circumstances, the node with the smallest `match_index` may have the greatest chance of having enough entries to be sent to every other group member. +3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set. And the feature of enabling choosing delegate of such 'leader group' can be controlled by upper layer. If the delegate of a group is determined, it’ll be cached in memory for later usage until Groups configuration is changed or the delegate is evicted by rejection message. And the flow control should be valid for any cached delegate. -#### MsgBroadcast and Commission +#### How delegate does replication -After the delegate is picked, the leader will prepare a `MsgBroadcast` message which is sent to the delegate. A MsgBroadcast just looks like a `MsgAppend` or `MsgSnapshot` to be sent to the delegate, which often only carries entries the delegate needs but also includes a collection of commissions. But if some members are requiring entries and others are requiring snapshots, the `MsgBroadcast` with be generated with both entries and a snapshot. -A *commission* just describes the range of entries or a snapshot that should be sent to the target. It looks like: - -``` -Commission { - type // Only two types: Append or Snapshot - target // The target group member - last_index - log_term -} -``` +We can add a new field into `Message`: `BcastTargets`, which is a list of peer IDs that the message receiver (the delegate) needs to replicate entries to. When a peer become delegate first time, it doesn't know followers' progress, so it appends its last log entry to other peers. After it receives `MsgAppendResponse` from those peers, it can update their progresses. You can just treat a commission as the metadata of a MsgAppend or MsgSnapshot since the `last_index` and `log_term` are set by the leader according to its progress set. -When the delegate receives a `MsgBroadcast`, it might meet any scenario below: - -1. If the message declares that the delegate needs a snapshot, it means that all the group members are requiring snapshots. Every commission type should be Snapshot and the delegate just broadcasts to others. -2. If the message declares that the delegate needs entries, just like the origin raft protocol, it first tries to append incoming entries to its unstable raft log. And if a conflict ocurrs, it sends a rejecting message to the leader and then the leader will try to pick a new delegate to re-send commissions again. -3. If step 2 succeeds, the delegate will try to execute all the commissions by gathering correspond entries and sending them to each group member. During this time, some of commissions could be failed due to the stale progress state in the leader node. The delegate will collect all the failed commissions and send them back to the leader to trigger a new broadcast message so that Log Replication is always ongoing. - -All the other group members have no idea whether a `MsgAppend` or `MsgSnapshot` comes from the leader or the delegate because the delegate always replaces the `from` of the message with the leader ID when executing a commission. So they finally handle these messages like normal followers. - -It’s significant to update the inflight of progress when the corresponding commission is generated and rollback the inflight once the leader receives failed commissions in step3 above, which also guarantees that the actual progress of group members (except the delegate) will not be stale. +Peers can know whether a `MsgAppend` comes from a delegate or the leader according to `delegate` value in the message. If a peer receives a `MsgAppend` from a delegate, it needs to response to both the leader and the delegate to sync the `Progress` on them. ## Drawbacks -When there is a large group or many groups, the log committing speed could be slow as entries will be sent to delegate first. +1. Committing speed could be slow if `quorum` must involve peers in different data center. +2. When handling a Raft message, a hashmap query to get a peer's group and delegate is introduced. +3. When a delegate is dismissed due to networking partition, the leader recovers processing Log Replication to peers in the delegate's group. But the information in `Inflights` on the delegate won't be extended by the leader. Therefore, at that time, the leader starts to send messages with a plain `Inflights`, which breaks the origin flow control. And vice versa in the situation where the leader picks a delegate. ## Alternatives -## Unresolved questions +In the current design, the leader not only let the delegate replicate logs to other peers but also 'delegate's the whole flow control to the delegate. We can benefit from it because of the lighter cross-data-center network transmission between the leader and peers but suffer from some issues described in **Drawbacks** section. + +There is an alternative design that the leader doesn't 'delegate' the flow control to the delegate. When the leader sends `BcastTargets` to the delegate, it adds the `last_index` of each member in `BcastTargets` to indicate the peer's progress from the leader's view. In this way, the messaging flow is always controlled by the leader and the `Inflights` in-consistency can be avoided. -The rule 2.3 for choosing a delegate might need some rethink. +## Unresolved questions From dc858e5bece347a43972b81f631e1f47ee5c2d58 Mon Sep 17 00:00:00 2001 From: Fullstop000 Date: Fri, 24 Jul 2020 11:46:59 +0800 Subject: [PATCH 8/8] add compatibility section Signed-off-by: Fullstop000 --- text/2019-11-13-follower-replication.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/text/2019-11-13-follower-replication.md b/text/2019-11-13-follower-replication.md index 7304542d..fe7d3794 100644 --- a/text/2019-11-13-follower-replication.md +++ b/text/2019-11-13-follower-replication.md @@ -74,12 +74,9 @@ Since a delegate is responsible to send entries to other group members, it must 1. If all the members are requiring snapshots, choose the delegate randomly. 2. Among the members who are requiring new entries, choose the node satisfies conditions below: - - 1. The progress has the maximal `match_index`. - 3. If no delegate is picked, the leader does Log Replication itself. Especially, if a group contains the leader it self, no delegate will be set. And the feature of enabling choosing delegate of such 'leader group' can be controlled by upper layer. -If the delegate of a group is determined, it’ll be cached in memory for later usage until Groups configuration is changed or the delegate is evicted by rejection message. And the flow control should be valid for any cached delegate. +If the delegate of a group is determined, it’ll be cached in memory for later usage until `Groups` configuration is changed or the delegate is evicted by rejection message. And the flow control should be valid for any cached delegate. #### How delegate does replication @@ -89,6 +86,19 @@ You can just treat a commission as the metadata of a MsgAppend or MsgSnapshot si Peers can know whether a `MsgAppend` comes from a delegate or the leader according to `delegate` value in the message. If a peer receives a `MsgAppend` from a delegate, it needs to response to both the leader and the delegate to sync the `Progress` on them. +#### Compatibility +A node in a group will send its `group_id` in msg and the receiver will update the sender's group info based on it. The leader only picks a delegate when the group info is enough. In a rolling-upgrade/downgrade situation, This can introduce several cases: + +##### Upgrade +- If only the leader uses follower replication, it can only know the group info until followers finish upgrading and send their `group_id` so that the leader uses origin log replication at this point +- If only a follower or part of them use follower replication, the leader will just ignore the `group_id` in the msg so no delegate will be picked and origin log replication keeps processing + +##### Downgrade +- If only the leader use origin log replication, the case is just like common raft cluster and nothing special happens (pick a delegate, broadcast appends) +- If a follower is downgraded and stop informing the leader its `group_id`, the leader will remove it from the group system and send entries directly to it + +By such a design, the compatibility can be guaranteed when nodes in the cluster use either origin log replication or follower replication + ## Drawbacks 1. Committing speed could be slow if `quorum` must involve peers in different data center.