Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove failed nodes from serfWAN #6028

Merged
merged 8 commits into from
Jun 28, 2019
Merged

Remove failed nodes from serfWAN #6028

merged 8 commits into from
Jun 28, 2019

Conversation

schristoff
Copy link
Contributor

Autopilot primarily focuses on serfLAN and places failed servers into a left state in that gossip pool, but does not do this for serfWAN.
Whenever serfWAN is in use, we should be removing any failed node from the gossip pool to create consistency between serfWAN and serfLAN.

In the force-leave command it is important to check if when passing nodenames to serfWAN it is formatted in the proper way (nodename.datacenter) or it will not properly be removed.

Fixes #3361

@schristoff schristoff requested review from hanshasselberg, a team and s-mang June 26, 2019 22:23
Copy link
Contributor

@s-mang s-mang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Just a drive-by, blind review here. Definitely ignore my comments if they aren't relevant or helpful.

Also, maybe a test for this new behavior would be good!

Cheers! Thanks for inviting me / letting me participate :)

@@ -22,7 +22,8 @@ type Delegate interface {
NotifyHealth(OperatorHealthReply)
PromoteNonVoters(*Config, OperatorHealthReply) ([]raft.Server, error)
Raft() *raft.Raft
Serf() *serf.Serf
SerfLAN() *serf.Serf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who implements this interface? I can't find much in the codebase with grep, which makes me think it's the customer or external user who implements this?
I think this looks like an API-breaking change, since the type is exported we're changing the interface definition. Is that acceptable? How do you guys usually handle breaking changes to public interfaces?

Apologies if this comment seems irrelevant! Still gaining project context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This interface is implemented by AutopilotDelegate, which is used in the agent/consul pkg to initialize a new NewAutopilot.

External users shouldn't be implementing this, they should only be interacting with consul via the api pkg.

//is failed - then check to see if it is a serfLAN
//member, remove it.
func (a *Autopilot) pruneSerfWAN(serfWAN *serf.Serf, serfLAN *serf.Serf) {
for _, memberWAN := range serfWAN.Members() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have very limited context about what this code is doing, so totally ignore this if it seems inappropriate or not important!

This looks like it could potentially be an expensive operation with the 2 nested for loops. Maybe this isn't super important, but maybe we can make the algorithm more efficient? Perhaps by caching something from the first or second for loop, and then just having 2 side-by-side loops? Maybe first iterating through either the serfLAN or serfWAN members and caching our findings in a way that is easy to look up (like in a map)? (making the algorithm go from O(n^2) => O(2n)).

Let me know if I’m not making sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey -
It was super expensive (I whined about this function and it's complexity, as well as it's poor readability imo) - I wouldn't want to cache anything for Serf due to the possibility of harming our results.
What this is s'posed to be doing is retrieving all the Serf members from WAN and LAN, comparing the results, and removing any failed members. The reason we did it like this was because we wanted to find any members in LAN after they have been removed, and if we find one then we need to remove it from WAN. We were trying to not do the calculation for Raft twice (where we make sure we can still maintain a consensus).

Luckily the new implementation gets rid of this function entirely. :)

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, almost there!

agent/consul/autopilot/autopilot.go Outdated Show resolved Hide resolved
agent/consul/autopilot/autopilot.go Outdated Show resolved Hide resolved
agent/consul/autopilot/autopilot.go Outdated Show resolved Hide resolved
agent/consul/server.go Outdated Show resolved Hide resolved
@@ -22,7 +22,8 @@ type Delegate interface {
NotifyHealth(OperatorHealthReply)
PromoteNonVoters(*Config, OperatorHealthReply) ([]raft.Server, error)
Raft() *raft.Raft
Serf() *serf.Serf
SerfLAN() *serf.Serf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This interface is implemented by AutopilotDelegate, which is used in the agent/consul pkg to initialize a new NewAutopilot.

External users shouldn't be implementing this, they should only be interacting with consul via the api pkg.

agent/consul/autopilot/autopilot.go Show resolved Hide resolved
@schristoff schristoff requested a review from a team June 27, 2019 23:09
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @s-christoff 🎉

agent/consul/server.go Outdated Show resolved Hide resolved
Co-Authored-By: Paul Banks <banks@banksco.de>
@schristoff schristoff merged commit f09af53 into master Jun 28, 2019
@schristoff schristoff deleted the 3361-remove-from-failed branch June 28, 2019 17:45
@hanshasselberg
Copy link
Member

I happened to look at it again, and it looks great! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recurring log "serf: attempting reconnect" to left server
5 participants