Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes server manager shift away from failed servers from Serf events. #3864

Merged
merged 1 commit into from
Feb 6, 2018

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Feb 6, 2018

Because this code was doing pointer equality checks, it would work for
the case of a failed attempted RPC because the objects are from the
manager itself:

https://github.com/hashicorp/consul/blob/v1.0.3/agent/consul/rpc.go#L283-L302

But the pointer check would always fail for events coming in from the
Serf path because the server object is newly-created:

https://github.com/hashicorp/consul/blob/v1.0.3/agent/router/serf_adapter.go#L14-L40

This means that we didn't proactively shift RPC traffic away from a
failed server, we'd have to wait for an RPC to fail, which exposes
the error to the calling client.

By switching over to a name check vs. a pointer check we get the correct
behavior. We added a DEBUG log as well to help observe this behavior during
integrated testing.

Related to #3863 since the fix here needed the same logic duplicated, owing
to the complicated atomic stuff.

/cc @dadgar for a heads up in case this also affects Nomad.

Because this code was doing pointer equality checks, it would work for
the case of a failed attempted RPC because the objects are from the
manager itself:

https://github.com/hashicorp/consul/blob/v1.0.3/agent/consul/rpc.go#L283-L302

But the pointer check would always fail for events coming in from the
Serf path because the server object is newly-created:

https://github.com/hashicorp/consul/blob/v1.0.3/agent/router/serf_adapter.go#L14-L40

This means that we didn't proactively shift RPC traffic away from a
failed server, we'd have to wait for an RPC to fail, which exposes
the error to the calling client.

By switching over to a name check vs. a pointer check we get the correct
behavior. We added a DEBUG log as well to help observe this behavior during
integrated testing.

Related to #3863 since the fix here needed the same logic duplicated, owing
to the complicated atomic stuff.

/cc @dadgar for a heads up in case this also affects Nomad.
@slackpad slackpad added this to the 1.0.4 milestone Feb 6, 2018
@slackpad slackpad merged commit 99a3651 into master Feb 6, 2018
@slackpad slackpad deleted the proactive-fail branch February 6, 2018 02:12
@dadgar
Copy link
Contributor

dadgar commented Feb 6, 2018

Thanks for the heads up. Nomad is unaffected by this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants