Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server rebalance refactor WIP #1743

Closed
wants to merge 75 commits into from
Closed

Server rebalance refactor WIP #1743

wants to merge 75 commits into from

Conversation

slackpad
Copy link
Contributor

This is a work-in-progress PR to take a quick look and provide early feedback as this is finished up.

In cases where i+1 is a power of two, skip one modulo operation.
Prep for breaking out maintenance of consuls into a new goroutine.
This mechanism isn't going to provide much value in the future.  Preemptively reduce the complexity of future work.
It is theoretically possible that the number of queued serf events can back up.  If this happens, emit a warning message if there are more than 200 events in queue.

Most notably, this can happen if `c.consulServerLock` is held for an "extended period of time".  The probability of anyone ever seeing this log message is hopefully low to nonexistent, but if it happens, the warning message indicating a large number of serf events fired while a lock was held is likely to be helpful (vs serf mysteriously blocking when attempting to add an event to a channel).
Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).
Expanding the domain of lastServer beyond RPC() changes the meaning of this variable.  Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.
A server is not normally disabled, but in the event of an RPC error, we want to mark a server as down to allow for fast failover to a different server.  This value must be an int in order to support atomic operations.

Additionally, this is the preliminary work required to bring up a server in a disabled state.  RPC health checks in the future could mark the server as alive, thereby creating an organic "slow start" feature for Consul.
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.

This commit brings in a background task that proactively manages the server list and:

*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed

This is a WIP, more work in testing needs to be completed.
server = serverCfg.servers[i]
break
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could end up falling through this loop and then hitting a nil pointer exception below. Need to bail with an structs.ErrNoServers if no healthy server was found.

@@ -125,6 +130,11 @@ func isConsulServer(m serf.Member) (bool, *serverParts) {

datacenter := m.Tags["dc"]
_, bootstrap := m.Tags["bootstrap"]
var disabled uint64 = 0
_, disabledStr := m.Tags["disabled"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self - where does this get set ever? We probably won't actually propagate this via tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I pushed this into a tag was it seemed conceivable that we'd want to support a "consul server slow-start" where a server would come up in a disabled state. 100% okay with removing support for this.

This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.
Relocated to its own package, server_manager.  This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary.  More work is needed to be done here
In cases where i+1 is a power of two, skip one modulo operation.
Prep for breaking out maintenance of consuls into a new goroutine.
This mechanism isn't going to provide much value in the future.  Preemptively reduce the complexity of future work.
It is theoretically possible that the number of queued serf events can back up.  If this happens, emit a warning message if there are more than 200 events in queue.

Most notably, this can happen if `c.consulServerLock` is held for an "extended period of time".  The probability of anyone ever seeing this log message is hopefully low to nonexistent, but if it happens, the warning message indicating a large number of serf events fired while a lock was held is likely to be helpful (vs serf mysteriously blocking when attempting to add an event to a channel).
Trivial change that makes it possible for developers to set an environment variable and change the output of `go test` to be detailed (i.e. `GOTEST_FLAGS=-v`).
Expanding the domain of lastServer beyond RPC() changes the meaning of this variable.  Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.
When first starting the server manager, it's possible that the rebalanceTimer in serverConfig will be nil, test accordingly.
Removing any ambiguity re: ownership of the mutated server lists is a win for maintenance and debugging.
…l into f-rebalance-worker

# Conflicts:
#	consul/server_manager/server_manager.go
…l into f-rebalance-worker

# Conflicts:
#	consul/server_manager/server_manager.go
…l into f-rebalance-worker

# Conflicts:
#	consul/server_manager/server_manager.go
Instead of blocking the RPC call path and performing a potentially expensive calculation (including a call to `c.LANMembers()`), introduce a channel to request a rebalance.  Some events don't force a reshuffle, instead the extend the duration of the current rebalance window because the environment thrashed enough to redistribute a client's load.
Debugging code crept into the actual test and hung out for much longer than it should have.
Rely on Serf for liveliness.  In the event of a failure, simply cycle the server to the end of the list.  If the server is unhealthy, Serf will reap the dead server.

Additional simplifications:

*) Only rebalance servers based on timers, not when a new server is readded to the cluster.
*) Back out the failure count in server_details.ServerDetails
Use an interface instead of serf.Serf as arg to NewServerManager.  Bonus points for improved testability.

Pointed out by: @slackpad
Prevent possible queueing behind serverConfigLock in the event that a server fails on a busy host.
There is no guarantee the server coming back is healthy.  It's apt to be healthy by virtue of its place in the server list, but it's not guaranteed.
Follow go style recommendations now that this has been refactored out of the consul package and doesn't need the qualifier in the name.
Matches the style of the rest of the repo
# Conflicts:
#	consul/client.go
Change the signature so it returns a value so that this can be tested externally with mock data.  See the sample table in TestServerManagerInternal_refreshServerRebalanceTimer() for the rate at which it will back off.  This function is mostly used to not cripple large clusters in the event of a partition.
@sean- sean- mentioned this pull request Mar 24, 2016
@sean- sean- closed this Mar 24, 2016
@sean- sean- deleted the f-rebalance-worker branch March 24, 2016 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants