An idea for a Leeds Code Dojo session - implement Paxos! In fact, just the interesting core: the Synod protocol.
Synod is a protocol for reaching agreement on a single value in a distributed system. Grant suggests we should use it to agree on a name for our new startup, so get thinking of a good name to propose!
The thing about distributed systems is that you have to assume the individual bits of the system are rather unreliable. In more detail:
- modules may fail (i.e. simply stop responding to messages) at any time
- modules may take arbitrarily long to respond
- messages may be dropped in transit (i.e. there is no guarantee that a sent message is ever received)
- messages may be delayed in transit and may not be received in the order in which they were sent
For many years it was believed that with all these things that could go wrong it wouldn't be possible to reliably do anything at all. This protocol was actually discovered in the process of trying to prove that no such protocol could exist.
Of course if things go too wrong then you can't really hope to achieve anything, but this protocol guarantees that the worst thing that can happen is that no agreement is reached, and this only happens while
- the network is dropping too many messages
- messages are being delayed too heavily in transit, or
- too many modules have failed
All of these problems tend to be transient: networks are eventually fixed and failed modules are eventually replaced, and when the situation returns to normal the protocol carries on and reaches agreement.
The system is unreliable but not actively malicious:
- modules all obey the protocol correctly
- messages may not be altered in transit
- message delivery is all-or-nothing
There are four kinds of module in the protocol:
- Nag
- Proposer
- Acceptor and
- Learner
All the Nag does is periodically broadcast a message indicating the current time, which is rather boring so has already been done. The intention is that each team implements one of the other three modules. We need at least one Learner, one Proposer and exactly three Acceptors:
When everything is working smoothly, the protocol runs as follows. First, the Nag broadcasts the current time period to the Acceptors:
When each Acceptor receives this broadcast, it sends a message to the Proposer, promising that in future it won't accept any proposals made before the current time period.
When the Proposer receives these promises messages from a majority of Acceptors (i.e. two of them), it proposes a value to agree upon in the current time period by broadcasting a message back to all the Acceptors.
Since the Acceptors have not made any further promises, they all accept the proposed value by sending messages to the Learner. Once the Learner receives these messages for the same time period from a majority of Acceptors (i.e. two of them), it has learned that this value is the one that the system has agreed upon.
There has to be exactly three Acceptors but there can be as many Learners and Proposers as desired. It's much more interesting if there are more than one of each:
-
With one Proposer there is only one value to be proposed (so inconsistency is impossible) and the system will stop working if this Proposer fails.
-
With one Learner there is only one place where a value is learned (so inconsistency is impossible) and the system will stop working if this Learner fails.
For the sake of interoperability, in our implementation the modules communicate with each other using JSON-formatted messages. This section gives the details of the messages each module may send and receive.
The Learner is the simplest module. Roughly speaking, it keeps track of which values have been accepted, and when it sees that a majority of Acceptors (i.e. two of them) are in agreement, it learns that consensus has been reached.
More precisely, it receives messages that look like this:
{"type":"accepted","timePeriod":$TIMEPERIOD,"by":$NAME,"value":$VALUE}
It doesn't send any messages, but should report to the user when it has learned
a value. It learns a value by receiving two messages for the same $TIMEPERIOD
(a positive integer) but different $NAME
s (which are strings). In this case,
the $VALUE
of the two messages will always be the same, so it doesn't matter
which one you choose to report.
A simple Learner implementation is to keep a list of all messages received and
check each new message against all the items that are already in the list,
looking for pairs that match on $TIMEPERIOD
but not on $NAME
. When such a
pair is found, simply print out the $VALUE
from either message.
Here is an example of the expected behaviour:
{"type":"accepted","timePeriod":1,"by":"alice","value":"value 1"}
// no value learned - no previous messages
{"type":"accepted","timePeriod":2,"by":"brian","value":"value 2"}
// no value learned - different $TIMEPERIOD from previous message
{"type":"accepted","timePeriod":2,"by":"brian","value":"value 2"}
// no value learned - same $NAME as previous messages
{"type":"accepted","timePeriod":2,"by":"chris","value":"value 2"}
-> 'value 2' // value learned - same $TIMEPERIOD but different $NAME compared with previous message
The Proposer is the next simplest module. Its job is to propose ideas for the
name of our startup. It knows your idea, which we'll call $MYVALUE
here.
Roughly speaking, it proposes $MYVALUE
if it hasn't receieved any other
ideas, but in the interests of harmony it prefers to agree with other
proposers' ideas when it hears about them. It also only makes a proposal when
it's received promises from a majority of Acceptors that they will accept it.
It is, in short, rather timid.
More precisely, it receives messages that look like one of these (which are sent by Acceptors):
{"type":"promised","timePeriod":$TIMEPERIOD,"by":$NAME,"haveAccepted":false}
{"type":"promised","timePeriod":$TIMEPERIOD,"by":$NAME,"lastAcceptedTimePeriod":$LATP,"lastAcceptedValue":$LAV}
When it has received two of these promised
messages for the same
$TIMEPERIOD
(a positive integer) with different $NAME
s (which are strings),
it should respond with a message like this:
{"type":"proposed","timePeriod":$TIMEPERIOD,"value":$VALUE}
In this situation, $VALUE
depends on the $LATP
and $LAV
values in
the two promised
messages as follows:
-
If neither message includes these values then use
$MYVALUE
. -
If just one of the
promised
messages includes these values then use its$LAV
. -
If both of the
promised
messages include these values then use the most recent$LAV
(i.e. the one with the greater value for$LATP
). In the case of a tie, either will do.
It must only ever send out a single proposed
message for each $TIMEPERIOD
.
A simple Proposer implementation is to keep
- a list of all messages received, and
- the latest
$TIMEPERIOD
for which aproposed
message have been sent.
When a new message is received, ignore it if its $TIMEPERIOD
value is no
greater than the latest-proposed $TIMEPERIOD
and otherwise check it against
all the other received messages. If any of them match on $TIMEPERIOD
but not
on $NAME
then take the first one and send a proposed
message with $VALUE
calculated as described above.
Here is an example of the expected behaviour:
{"type":"promised","timePeriod":1,"by":"alice","haveAccepted":false}
// nothing proposed - no previous messages
{"type":"promised","timePeriod":2,"by":"brian","haveAccepted":false}
// nothing proposed - different $TIMEPERIOD from previous message
{"type":"promised","timePeriod":2,"by":"brian","haveAccepted":false}
// nothing proposed - different $TIMEPERIOD from first message and same $NAME as second message
{"type":"promised","timePeriod":2,"by":"chris","haveAccepted":false}
-> {"type":"proposed","timePeriod":2,"value":"my awesome startup name"}
// proposal made using $MYVALUE as no $LAV given in any promises
{"type":"promised","timePeriod":2,"by":"alice","haveAccepted":false}
// nothing proposed - have already made a proposal for time period 2
{"type":"promised","timePeriod":3,"by":"brian","haveAccepted":false}
// nothing proposed - different $TIMEPERIOD from all earlier messages
{"type":"promised","timePeriod":3,"by":"alice","lastAcceptedTimePeriod":1,"lastAcceptedValue":"AliceCo"}
-> {"type":"proposed","timePeriod":3,"value":"AliceCo"}
// proposal made using $LAV from Alice's promise as it included a lastAcceptedValue field
{"type":"promised","timePeriod":4,"by":"alice","lastAcceptedTimePeriod":1,"lastAcceptedValue":"AliceCo"}
// nothing proposed - different $TIMEPERIOD from all earlier messages
{"type":"promised","timePeriod":4,"by":"brian","haveAccepted":false}
-> {"type":"proposed","timePeriod":4,"value":"AliceCo"}
// proposal made using $LAV from Alice's promise as it included a lastAcceptedValue field
{"type":"promised","timePeriod":5,"by":"alice","lastAcceptedTimePeriod":1,"lastAcceptedValue":"AliceCo"}
// nothing proposed - different $TIMEPERIOD from all earlier messages
{"type":"promised","timePeriod":5,"by":"brian","lastAcceptedTimePeriod":2,"lastAcceptedValue":"BrianCo"}
-> {"type":"proposed","timePeriod":5,"value":"BrianCo"}
// proposal made using $LAV from Brian's promise as it has the greater $LATP (so is fresher than Alice's)
{"type":"promised","timePeriod":6,"by":"brian","lastAcceptedTimePeriod":2,"lastAcceptedValue":"BrianCo"}
// nothing proposed - different $TIMEPERIOD from all earlier messages
{"type":"promised","timePeriod":6,"by":"alice","lastAcceptedTimePeriod":1,"lastAcceptedValue":"AliceCo"}
-> {"type":"proposed","timePeriod":6,"value":"BrianCo"}
// proposal made using $LAV from Brian's promise as it has the greater $LATP (so is fresher than Alice's)
The Acceptor is the most complicated module. Its job is to accept proposals, but it also makes promises about which proposals it will accept in the future in order to persuade the somewhat timid Proposer to make the proposals in the first place.
It has a name, $NAME
(one of "alice"
, "brian"
or "chris"
) which will be
agreed in advance as it must not clash with that of the other Acceptors. It
must include its name in the by
field of any messages it sends.
It receives two kinds of message:
{"type":"prepare","timePeriod":$TIMEPERIOD}
{"type":"proposed","timePeriod":$TIMEPERIOD,"value":$VALUE}
The prepare
message comes from the Nag and indicates the start of a new time
period. The proposed
message comes from a Proposer and proposes a value for
the given time period.
Under the conditions set out below it may respond to these with:
// in response to a 'prepare':
{"type":"promised","timePeriod":$TIMEPERIOD,"by":$NAME,"haveAccepted":false}
{"type":"promised","timePeriod":$TIMEPERIOD,"by":$NAME,"lastAcceptedTimePeriod":$LATP,"lastAcceptedValue":$LAV}
// in response to a 'proposed':
{"type":"accepted","timePeriod":$TIMEPERIOD,"by":$NAME,"value":$VALUE}
Throughout, $TIMEPERIOD
and $LATP
are positive integers, and all other
values are strings.
When it receives a prepare
message the Acceptor should compare it to the last
accepted
message it sent:
-
If it has not yet sent any
accepted
messages, it should respond with apromised
message without thelastAccepted*
fields. -
If the last
accepted
message was sent in a strictly earlier time period than theprepare
message's$TIMEPERIOD
then it should respond with apromised
message with thelastAccepted*
fields, where$LAV
and$LATP
are respectively set to the$VALUE
and$TIMEPERIOD
of the lastaccepted
message. -
If the last
accepted
message was sent in an equal or later time period than theprepare
message's$TIMEPERIOD
then nopromised
message is sent.
The purpose of the promised
message is to indicate that this Acceptor will no
longer accept any proposals from earlier time periods.
When it receives a proposed
message, it may respond with a corresponding
accepted
message as long as
- doing so does not break any previous promises (i.e. it has not sent a
promised
message for a strictly later time period) - doing so does not clash with any previous acceptances (i.e. all
accepted
messages sent so far were for strictly earlier time periods)
A simple Acceptor implementation is to track the greatest $TIMEPERIOD
for
which an accepted
message has been sent (along with the corresponding
$VALUE
) and also to track the greatest $TIMEPERIOD
for which a promised
message has been sent. Here is an example of the expected behaviour:
{"type":"prepare","timePeriod":2}
-> {"type":"promised","timePeriod":2,"by":"me","haveAccepted":false}
// NB no "lastAccepted*" as nothing accepted yet
{"type":"prepare","timePeriod":1}
-> {"type":"promised","timePeriod":1,"by":"me","haveAccepted":false}
// ok to send out an earlier promise too
{"type":"prepare","timePeriod":2}
-> {"type":"promised","timePeriod":2,"by":"me","haveAccepted":false}
// ok to send out a duplicate promise
{"type":"proposed","timePeriod":1,"value":"value 1"}
// have promised not to accept proposals from time periods before 2 so do not respond to this
{"type":"proposed","timePeriod":2,"value":"value 2"}
-> {"type":"accepted","timePeriod":2,"by":"me","value":"value 2"}
// ok to accept this proposal as it is consistent with all promises: it is not in a time period before 2
{"type":"prepare","timePeriod":1}
// last accepted message was sent in a later time period, so no response
{"type":"prepare","timePeriod":2}
// last accepted message was sent in this time period, so no response
{"type":"prepare","timePeriod":3}
-> {"type":"promised","timePeriod":3,"by":"me","lastAcceptedTimePeriod":2,"lastAcceptedValue":"value 2"}
// send out a promise, but now includes "lastAccepted*" fields
{"type":"proposed","timePeriod":4,"value":"value 4"}
-> {"type":"accepted","timePeriod":4,"by":"me","value":"value 4"}
// ok to accept this proposal as it is from time period >= 3 so consistent with all promises
{"type":"proposed","timePeriod":4,"value":"different value 4"}
// have already accepted proposal 4, so do not accept it again.
{"type":"proposed","timePeriod":3,"value":"value 3"}
// have already accepted proposal 4, so do not accept earlier-numbered
// proposals.
For the sake of simplicity, the modules will communicate with one another via a HTTP messaging system running on the internet. Each team will get their own URL for accessing the messaging system.
To get a pending message, perform a HTTP GET to your URL. If there is a message
waiting for you, it will immediately be returned with status 200 OK
. If there
are no messages then the GET will wait for a while to see if one turns up.
Eventually, it will return 204 No Content
if the wait was fruitless. You can
immediately retry the request at this point.
To send a message, perform a HTTP POST to your URL with Content-type: application/json
and the message in the body of the request. This will return
204 No Content
if successful or 400 Bad Request
if the message was not
correctly formatted.
The Nag is built into the messaging system, and it also has the facility to simulate network problems (message drops and delays) in order to experiment with the protocol's reaction to failures.
Here are some diagrams that illustrate how the modules described above all work together to achieve consensus on a single value.
First consider the simplest situation again: one Proposer, one Learner and three Acceptors.
The Nag sends broadcasts a prepare
message to the Acceptors for the first time
period.
In turn, the Acceptors send promised
messages to the Proposer. Note that none
of these have lastAccepted*
fields since no acceptor has accepted a value
yet.
When the Proposer receives two of these promised
messages, it broadcasts a
proposed
message back to all the Acceptors. Since none of the promised
messages have lastAccepted*
fields, the value proposed is $MYVALUE
.
The Acceptors may all accept this proposal as it's compatible with the promises
they made previously. They send accepted
messages to the Learner, and once
the Learner receives two of these messages it learns that consensus has been
achieved.
This demonstrates the basic message flow, but since there's only one Proposer and one the Learner the system cannot end up disagreeing on the value chosen. Consider a more complicated situation with two Proposers and two Learners as follows.
The Nag sends broadcasts a prepare
message to the Acceptors for the first time
period.
Again, the Acceptors all send out promised
messages, and again these do not
have lastAccepted*
fields since no value has been accepted.
Note also that this is not a broadcast: these messages are routed to a single
Proposer. In a real implementation, the choice of Proposer for this time period
would be made by the Nag in the very first step, and this choice would have to
be passed around in every message. This only serves to make it harder to
understand what is going on, so it has been omitted. Instead, since the Nag is
built into the messaging system it can route promised
messages to the right
proposer without bothering the individual modules with this detail.
If things carried on without a hitch then the second Proposer wouldn't get to
do anything, so let's consider the case where two of the promised
messages
are not delivered.
Since the Proposer has only received a single promised
message, it does
nothing. Eventually the Nag broadcasts another prepare
message for the second
time period.
The Acceptors all send out further promised
messages which are routed to the
second Proposer. Note that still no Acceptor has accepted a value yet, so
these promised
messages still do not have lastAccepted*
fields.
They arrive successfully, so the Proposer broadcasts a proposal
with its own
value.
The Acceptors may all accept this proposal as it's compatible with the promises
they made previously. They broadcast accepted
messages to the Learners, which
are all delivered successfully so both Learners learn the value chosen.
But there's more! Imagine that the undelivered promised
messages were merely
delayed and not dropped, and that one more of them is now delivered.
At this point the first Proposer broadcasts a proposed
message with its own
value, which is different from the other Proposer's value.
But this proposed
message is not compatible with the Acceptors' promises any
more: they promised to accept no propositions in time periods before the
second, but this proposal was for the first time period. The message is dropped
by all Acceptors, and there is no inconsistency in the values accepted.
Here is an illustration of another kind of network glitch, showing why
the Acceptors include the details of what they have previously accepted. Again,
it starts with the Nag broadcasting a prepare
message for the first time period.
The Acceptors send out promised
messages which are routed to the first
Proposer.
When the Proposer receives two, it proposes its value.
This proposal is compatible with earlier promises, so it is accepted by all the Acceptors. But this time the network glitch drops all the messages to the second learner. The first learner learns the proposed value, but the second is none the wiser.
Eventually, the Nag sends out another prepare
message for the second time
period.
The Acceptors send out promised
messages in reaction. However, since they
have all previously accepted a value, these messages include lastAccepted*
fields.
When the Proposer receives two of these promised
messages, it sends out a
proposal. Crucially, it may not propose its own value as it received
lastAccepted*
fields, which by the rules above it must propose instead. Thus
it proposes the same value as the one the other Proposer previously proposed.
The proposal is compatible with the promises, so it is accepted. This time, the
accepted
messages make it through to the other Learner so it now learns the
same value that the first Learner learned in the previous time period.
There are obviously many more ways that things can go wrong, but it is possible to prove that no matter how badly the network is performing everything still remains consistent.