-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
gnatsd: -health, -rank, -lease, -beat
options to control health monitoring. The InternalClient interface offers a general plugin interface for running internal clients within a gnatsd process. The -health flag to gnatsd starts an internal client that runs a leader election among the available gnatsd instances and publishes cluster membership changes to a set of cluster health topics. The -beat and -lease flags control how frequently health checks are run, and how long leader leases persist. The health agent can also be run standalone as healthcmd. See the main method in gnatsd/health/healthcmd. The -rank flag to gnatsd adds priority rank assignment from the command line. The lowest ranking gnatsd instance wins the lease on the current election. The election algorithm is described in gnatsd/health/ALGORITHM.md and is implemented in gnatsd/health/health.go. Fixes #433
- Loading branch information
Showing
22 changed files
with
2,893 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
# The ALLCALL leader election algorithm. | ||
|
||
Jason E. Aten | ||
|
||
February 2017 | ||
|
||
definition, with example value. | ||
----------- | ||
|
||
Let heartBeat = 3 sec. This is how frequently | ||
we will assess cluster health by | ||
sending out an allcall ping. | ||
|
||
givens | ||
-------- | ||
|
||
* Given: Let each server have a numeric integer rank, that is distinct | ||
and unique to that server. If necessary an extremely long | ||
true random number is used to break ties between server ranks, so | ||
that we may assert, with probability 1, that all ranks are distinct. | ||
Server can be put into a strict total order. | ||
|
||
* Rule: The lower the rank is preferred for being the leader. | ||
|
||
|
||
ALLCALL Algorithm | ||
=========================== | ||
|
||
### I. In a continuous loop | ||
|
||
The server always accepts and respond | ||
to allcall broadcasts from other cluster members. | ||
|
||
The allcall() ping asks each member of | ||
the server cluster to reply. Replies | ||
provide the responder's own assigned rank | ||
and identity. | ||
|
||
### II. Election | ||
|
||
Election are computed locally, after accumulating | ||
responses. | ||
|
||
After issuing an allcall, the server | ||
listens for a heartbeat interval. | ||
|
||
At the end of the interval, it sorts | ||
the respondents by rank. Since | ||
the replies are on a broadcast | ||
channel, more than one simultaneous | ||
allcall reply may be incorporated into | ||
the set of respondents. | ||
|
||
The lowest ranking server is elected | ||
as leader. | ||
|
||
## Safety/Convergence: ALLCALL converges to one leader | ||
|
||
Suppose two nodes are partitioned and so both are leaders on | ||
their own side of the network. Then suppose the network | ||
is joined again, so the two leaders are brought together | ||
by a healing of the network, or by adding a new link | ||
between the networks. The two nodes exchange Ids and | ||
ranks via responding to allcalls, and compute | ||
who is the new leader. | ||
|
||
Hence the two leader situation persists for at | ||
most one heartbeat term after the network join. | ||
|
||
## Liveness: a leader will be chosen | ||
|
||
Given the total order among nodes, exactly one | ||
will be lowest rank and thus be the preferred | ||
leader. If the leader fails, the next | ||
heartbeat of allcall will omit that | ||
server from the candidate list, and | ||
the next ranking server will be chosen. | ||
|
||
Hence, with at least one live | ||
node, the system can run for at most one | ||
heartbeat term before electing a leader. | ||
Since there is a total order on | ||
all live (non-failed) servers, only | ||
one will be chosen. | ||
|
||
## commentary | ||
|
||
ALLCALL does not guarantee that there will | ||
never be more than one leader. Availability | ||
in the face of network partition is | ||
desirable in many cases, and ALLCALL is | ||
appropriate for these. This is congruent | ||
with Nats design as an always-on system. | ||
|
||
ALLCALL does not guarantee that a | ||
leader will always be present, but | ||
with live nodes it does provide | ||
that the cluster will have a leader | ||
after one heartbeat term has | ||
been initiated and completed. | ||
|
||
By design, ALLCALL functions well | ||
in a cluster with any number of nodes. | ||
One and two nodes, or an even number | ||
of nodes, will work just fine. | ||
|
||
Compared to quorum based elections | ||
like raft and paxos, where an odd | ||
number of at least three | ||
nodes is required to make progress, | ||
this can be very desirable. | ||
|
||
ALLCALL is appropriate for AP, | ||
rather than CP, style systems, where | ||
availability is more important | ||
than having a single writer. When | ||
writes are idempotent or deduplicated | ||
downstream, this is typically preferred. | ||
|
||
prior art | ||
---------- | ||
|
||
ALLCALL is a simplified version of | ||
the well known Bully Algorithm[1][2] | ||
for election in a distributed system | ||
of arbitrary graph. | ||
|
||
ALLCALL does less bullying, and lets | ||
nodes arrive at their own conclusions. | ||
The essential broadcast and ranking | ||
mechanism, however, is identical. | ||
|
||
[1] Hector Garcia-Molina, Elections in a | ||
Distributed Computing System, IEEE | ||
Transactions on Computers, | ||
Vol. C-31, No. 1, January (1982) 48–59 | ||
|
||
[2] https://en.wikipedia.org/wiki/Bully_algorithm | ||
|
||
implementation | ||
------------ | ||
|
||
ALLCALL is implented on top of | ||
the Nats (https://nats.io) system, see the health/ | ||
subdirectory of | ||
|
||
https://github.com/nats-io/gnatsd | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
package health | ||
|
||
import ( | ||
"fmt" | ||
"net" | ||
"time" | ||
|
||
"github.com/nats-io/gnatsd/server" | ||
) | ||
|
||
// Agent implements the InternalClient interface. | ||
// It provides health status checks and | ||
// leader election from among the candidate | ||
// gnatsd instances in a cluster. | ||
type Agent struct { | ||
opts *server.Options | ||
mship *Membership | ||
} | ||
|
||
// NewAgent makes a new Agent. | ||
func NewAgent(opts *server.Options) *Agent { | ||
return &Agent{ | ||
opts: opts, | ||
} | ||
} | ||
|
||
// Name should identify the internal client for logging. | ||
func (h *Agent) Name() string { | ||
return "health-agent" | ||
} | ||
|
||
// Start makes an internal | ||
// entirely in-process client that monitors | ||
// cluster health and manages group | ||
// membership functions. | ||
// | ||
func (h *Agent) Start( | ||
info server.Info, | ||
opts server.Options, | ||
logger server.Logger, | ||
|
||
) (net.Conn, error) { | ||
|
||
// To keep the health client fast and its traffic | ||
// internal-only, we use an bi-directional, | ||
// in-memory version of a TCP stream. | ||
// | ||
// The buffers really do have to be of | ||
// sufficient size, or we will | ||
// deadlock/livelock the system. | ||
// | ||
cli, srv, err := NewInternalClientPair() | ||
if err != nil { | ||
return nil, fmt.Errorf("NewInternalClientPair() returned error: %s", err) | ||
} | ||
|
||
rank := opts.HealthRank | ||
beat := opts.HealthBeat | ||
lease := opts.HealthLease | ||
|
||
cfg := &MembershipCfg{ | ||
MaxClockSkew: time.Second, | ||
BeatDur: beat, | ||
LeaseTime: lease, | ||
MyRank: rank, | ||
CliConn: cli, | ||
Log: logger, | ||
} | ||
h.mship = NewMembership(cfg) | ||
go h.mship.Start() | ||
return srv, nil | ||
} | ||
|
||
// Stop halts the background goroutine. | ||
func (h *Agent) Stop() { | ||
h.mship.Stop() | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
package health | ||
|
||
import ( | ||
"encoding/json" | ||
"os" | ||
"time" | ||
|
||
"github.com/nats-io/go-nats" | ||
) | ||
|
||
// AgentLoc conveys to interested parties | ||
// the Id and location of one gnatsd | ||
// server in the cluster. | ||
type AgentLoc struct { | ||
ID string `json:"serverId"` | ||
Host string `json:"host"` | ||
Port int `json:"port"` | ||
|
||
// Are we the leader? | ||
IsLeader bool `json:"leader"` | ||
|
||
// LeaseExpires is zero for any | ||
// non-leader. For the leader, | ||
// LeaseExpires tells you when | ||
// the leaders lease expires. | ||
LeaseExpires time.Time `json:"leaseExpires"` | ||
|
||
// lower rank is leader until lease | ||
// expires. Ties are broken by ID. | ||
// Rank should be assignable on the | ||
// gnatsd command line with -rank to | ||
// let the operator prioritize | ||
// leadership for certain hosts. | ||
Rank int `json:"rank"` | ||
|
||
// Pid or process id is the only | ||
// way to tell apart two processes | ||
// sometimes, if they share the | ||
// same nats server. | ||
// | ||
// Pid is the one difference between | ||
// a nats.ServerLoc and a health.AgentLoc. | ||
// | ||
Pid int `json:"pid"` | ||
} | ||
|
||
func (s *AgentLoc) String() string { | ||
by, err := json.Marshal(s) | ||
panicOn(err) | ||
return string(by) | ||
} | ||
|
||
func (s *AgentLoc) fromBytes(by []byte) error { | ||
return json.Unmarshal(by, s) | ||
} | ||
|
||
func alocEqual(a, b *AgentLoc) bool { | ||
aless := AgentLocLessThan(a, b) | ||
bless := AgentLocLessThan(b, a) | ||
return !aless && !bless | ||
} | ||
|
||
func slocEqualIgnoreLease(a, b *AgentLoc) bool { | ||
a0 := *a | ||
b0 := *b | ||
a0.LeaseExpires = time.Time{} | ||
a0.IsLeader = false | ||
b0.LeaseExpires = time.Time{} | ||
b0.IsLeader = false | ||
|
||
aless := AgentLocLessThan(&a0, &b0) | ||
bless := AgentLocLessThan(&b0, &a0) | ||
return !aless && !bless | ||
} | ||
|
||
// the 2 types should be kept in sync. | ||
// We return a brand new &AgentLoc{} | ||
// with contents filled from loc. | ||
func natsLocConvert(loc *nats.ServerLoc) *AgentLoc { | ||
return &AgentLoc{ | ||
ID: loc.ID, | ||
Host: loc.Host, | ||
Port: loc.Port, | ||
IsLeader: loc.IsLeader, | ||
LeaseExpires: loc.LeaseExpires, | ||
Rank: loc.Rank, | ||
Pid: os.Getpid(), | ||
} | ||
} |
Oops, something went wrong.