Server rebalance refactor WIP #1743

slackpad · 2016-02-19T20:20:41Z

This is a work-in-progress PR to take a quick look and provide early feedback as this is finished up.

In cases where i+1 is a power of two, skip one modulo operation.

Prep for breaking out maintenance of consuls into a new goroutine.

…alance-worker

This mechanism isn't going to provide much value in the future. Preemptively reduce the complexity of future work.

It is theoretically possible that the number of queued serf events can back up. If this happens, emit a warning message if there are more than 200 events in queue. Most notably, this can happen if `c.consulServerLock` is held for an "extended period of time". The probability of anyone ever seeing this log message is hopefully low to nonexistent, but if it happens, the warning message indicating a large number of serf events fired while a lock was held is likely to be helpful (vs serf mysteriously blocking when attempting to add an event to a channel).

Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).

Expanding the domain of lastServer beyond RPC() changes the meaning of this variable. Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.

A server is not normally disabled, but in the event of an RPC error, we want to mark a server as down to allow for fast failover to a different server. This value must be an int in order to support atomic operations. Additionally, this is the preliminary work required to bring up a server in a disabled state. RPC health checks in the future could mark the server as alive, thereby creating an organic "slow start" feature for Consul.

Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go. This commit brings in a background task that proactively manages the server list and: *) reshuffles the list *) manages the timer out of the RPC() path *) uses atomics to detect a server has failed This is a WIP, more work in testing needs to be completed.

slackpad · 2016-02-19T20:51:01Z

consul/client.go

+			server = serverCfg.servers[i]
+			break
+		}
+	}


You could end up falling through this loop and then hitting a nil pointer exception below. Need to bail with an structs.ErrNoServers if no healthy server was found.

@slackpad

Pointed out by: @slackpad

@slackpad

Pointed out by: @slackpad

slackpad · 2016-02-19T21:24:31Z

consul/util.go

@@ -125,6 +130,11 @@ func isConsulServer(m serf.Member) (bool, *serverParts) {

 	datacenter := m.Tags["dc"]
 	_, bootstrap := m.Tags["bootstrap"]
+	var disabled uint64 = 0
+	_, disabledStr := m.Tags["disabled"]


Note to self - where does this get set ever? We probably won't actually propagate this via tags.

The only reason I pushed this into a tag was it seemed conceivable that we'd want to support a "consul server slow-start" where a server would come up in a disabled state. 100% okay with removing support for this.

This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.

Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here

In cases where i+1 is a power of two, skip one modulo operation.

Prep for breaking out maintenance of consuls into a new goroutine.

This mechanism isn't going to provide much value in the future. Preemptively reduce the complexity of future work.

It is theoretically possible that the number of queued serf events can back up. If this happens, emit a warning message if there are more than 200 events in queue. Most notably, this can happen if `c.consulServerLock` is held for an "extended period of time". The probability of anyone ever seeing this log message is hopefully low to nonexistent, but if it happens, the warning message indicating a large number of serf events fired while a lock was held is likely to be helpful (vs serf mysteriously blocking when attempting to add an event to a channel).

Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).

Expanding the domain of lastServer beyond RPC() changes the meaning of this variable. Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.

When first starting the server manager, it's possible that the rebalanceTimer in serverConfig will be nil, test accordingly.

Removing any ambiguity re: ownership of the mutated server lists is a win for maintenance and debugging.

…l into f-rebalance-worker # Conflicts: # consul/server_manager/server_manager.go

Instead of blocking the RPC call path and performing a potentially expensive calculation (including a call to `c.LANMembers()`), introduce a channel to request a rebalance. Some events don't force a reshuffle, instead the extend the duration of the current rebalance window because the environment thrashed enough to redistribute a client's load.

Debugging code crept into the actual test and hung out for much longer than it should have.

Rely on Serf for liveliness. In the event of a failure, simply cycle the server to the end of the list. If the server is unhealthy, Serf will reap the dead server. Additional simplifications: *) Only rebalance servers based on timers, not when a new server is readded to the cluster. *) Back out the failure count in server_details.ServerDetails

@slackpad

Use an interface instead of serf.Serf as arg to NewServerManager. Bonus points for improved testability. Pointed out by: @slackpad

Prevent possible queueing behind serverConfigLock in the event that a server fails on a busy host.

No longer needed code.

There is no guarantee the server coming back is healthy. It's apt to be healthy by virtue of its place in the server list, but it's not guaranteed.

Follow go style recommendations now that this has been refactored out of the consul package and doesn't need the qualifier in the name.

Matches the style of the rest of the repo

# Conflicts: # consul/client.go

Change the signature so it returns a value so that this can be tested externally with mock data. See the sample table in TestServerManagerInternal_refreshServerRebalanceTimer() for the rate at which it will back off. This function is mostly used to not cripple large clusters in the event of a partition.

sean- added 11 commits February 18, 2016 15:17

Use rand.Int31n() to get power of two optimization

0f5bd60

In cases where i+1 is a power of two, skip one modulo operation.

Fix whitespace alignment in a comment

33e72a9

Rename c.consuls to c.consulServers

6f4aa72

Prep for breaking out maintenance of consuls into a new goroutine.

Merge branch 'master' of ssh://github.com/hashicorp/consul into f-reb…

c9c732e

…alance-worker

Remove lastRPCTime

1ff7ea7

This mechanism isn't going to provide much value in the future. Preemptively reduce the complexity of future work.

Commit miss re: consuls variable rename

d124895

Introduce GOTEST_FLAGS to conditionally add -v to go test

e451fd5

Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).

Rename lastServer to preferredServer

e863d9c

Expanding the domain of lastServer beyond RPC() changes the meaning of this variable. Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.

slackpad reviewed Feb 19, 2016
View reviewed changes

sean- added 2 commits February 19, 2016 13:18

Handle the case where there are no healthy servers

4f6a85f

Pointed out by: @slackpad

Rename serverConfigMtx to serverConfigLock

f12998b

Pointed out by: @slackpad

slackpad reviewed Feb 19, 2016
View reviewed changes

sean- added 10 commits February 19, 2016 17:26

Refactor consul.serverParts into server_details.ServerDetails

2df351b

This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.

Move consul.serverConfig out of the consul package

17581d7

Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here

Use rand.Int31n() to get power of two optimization

7227d60

In cases where i+1 is a power of two, skip one modulo operation.

Fix whitespace alignment in a comment

52d2420

Rename c.consuls to c.consulServers

80095a1

Prep for breaking out maintenance of consuls into a new goroutine.

Remove lastRPCTime

3fb74cc

This mechanism isn't going to provide much value in the future. Preemptively reduce the complexity of future work.

Commit miss re: consuls variable rename

3091c0b

Introduce GOTEST_FLAGS to conditionally add -v to go test

402d146

Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).

Rename lastServer to preferredServer

c46b470

Expanding the domain of lastServer beyond RPC() changes the meaning of this variable. Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.

sean- added 14 commits February 19, 2016 19:08

Properly retain a pointer to the rebalanceTimer

c06c563

rebalanceTimer may be nil during initialization

c696b64

When first starting the server manager, it's possible that the rebalanceTimer in serverConfig will be nil, test accordingly.

Mutate copies of serverCfg.servers, not original

212daa5

Removing any ambiguity re: ownership of the mutated server lists is a win for maintenance and debugging.

Merge branch 'f-rebalance-worker' of ssh://github.com/hashicorp/consu…

1a0d584

…l into f-rebalance-worker # Conflicts: # consul/server_manager/server_manager.go

Merge branch 'f-rebalance-worker' of ssh://github.com/hashicorp/consu…

ba8b2df

…l into f-rebalance-worker # Conflicts: # consul/server_manager/server_manager.go

Merge branch 'f-rebalance-worker' of ssh://github.com/hashicorp/consu…

e3e98e5

…l into f-rebalance-worker # Conflicts: # consul/server_manager/server_manager.go

Commit a handful of refactoring && copy/paste-o fixes

2b00ced

Use saveServerConfig vs atomic.Value.Store(config)

36300f6

Update Serf to include serf.NumNodes()

5b45e62

Comment nits

8e55d32

Unbreak client tests by reverting to original test

b489cbc

Debugging code crept into the actual test and hung out for much longer than it should have.

Make use of interfaces

3d1751b

Use an interface instead of serf.Serf as arg to NewServerManager. Bonus points for improved testability. Pointed out by: @slackpad

slackpad mentioned this pull request Feb 24, 2016

Reverts server connection rebalancing changes from #1667 #1757

Merged

sean- added 11 commits February 24, 2016 19:11

Emulate a TryLock using atomic.CompareAndSwap

0245727

Prevent possible queueing behind serverConfigLock in the event that a server fails on a busy host.

Remove additional cruft from ServerManager's channels

9bc798c

No longer needed code.

Update comments to reflect reality

d49d079

Missed unit test cruft

c6831e7

cycleServer is a pure function, save the result

6ec3deb

Rename FindHealthyServer() to FindServer()

9c98734

There is no guarantee the server coming back is healthy. It's apt to be healthy by virtue of its place in the server list, but it's not guaranteed.

Rename NewServerManger to just New

f0b1d90

Follow go style recommendations now that this has been refactored out of the consul package and doesn't need the qualifier in the name.

Rename GetNumServers to NumServers()

136cf0f

Matches the style of the rest of the repo

Add a handful more unit tests to the public interface

130b6e3

Merge branch 'master' into f-rebalance-worker

082e5c4

# Conflicts: # consul/client.go

sean- mentioned this pull request Mar 24, 2016

Rebalance agents #1873

Merged

sean- closed this Mar 24, 2016

sean- deleted the f-rebalance-worker branch March 24, 2016 05:57

slackpad mentioned this pull request Nov 22, 2016

Periodically redistribute RPC client connections... #1649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server rebalance refactor WIP #1743

Server rebalance refactor WIP #1743

slackpad commented Feb 19, 2016

slackpad Feb 19, 2016

slackpad Feb 19, 2016

sean- Feb 19, 2016

Server rebalance refactor WIP #1743

Server rebalance refactor WIP #1743

Conversation

slackpad commented Feb 19, 2016

slackpad Feb 19, 2016

Choose a reason for hiding this comment

slackpad Feb 19, 2016

Choose a reason for hiding this comment

sean- Feb 19, 2016

Choose a reason for hiding this comment