Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish errors between DC not existing and not available #6399

Merged
merged 4 commits into from
Sep 3, 2019

Conversation

pierresouchay
Copy link
Contributor

Following Discussion in hashicorp/consul-template#1250 and #5881 this PR want to give the ability to clients to distinguish Datacenter not existing at all from not currently available (because of downtime of the DC or temp link unavailability).

This cause some issues for instance in templates such as consul-template or consul-templaterb where it is difficult for the client to choose correct behavior (retry or fail).

This PR allow to diagnose whether the datacenter requested is temporary non-available or if the DC does not exists.

This will allow clients to behave differently if temp error or wrong DC.
@pierresouchay
Copy link
Contributor Author

Tests should be fine, but CI fails: https://circleci.com/gh/hashicorp/consul/69254

Force-pushing to replay tests

@pierresouchay
Copy link
Contributor Author

@freddygv License/CLA is now working while I signed it long time ago :(

@ShimmerGlass
Copy link
Contributor

lgtm

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay After discussing this we realized that issue in #5881 could be resolved in Consul rather than making clients deal with handling different error messages.

Here is a PR that will exclude servers that have left from /catalog/datacenters:
#6420

However, for the purpose of debugging cross-DC communication, I do think your PR would be useful, since the existing no path to dc message is not as helpful as it could be.

I made some comments inline.

agent/consul/rpc.go Outdated Show resolved Hide resolved
agent/router/router.go Outdated Show resolved Hide resolved
@@ -315,7 +315,11 @@ func (s *Server) getLeader() (bool, *metadata.Server) {
func (s *Server) forwardDC(method, dc string, args interface{}, reply interface{}) error {
manager, server, ok := s.router.FindRoute(dc)
if !ok {
s.logger.Printf("[WARN] consul.rpc: RPC request for DC %q, no path found", dc)
if s.router.IsDatacenterDefined(dc) {
s.logger.Printf("[WARN] consul.rpc: RPC request for DC %q temporary failure", dc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these errors may often be temporary, I think this would also happen if servers in a remote DC go into a failed state. The router will still know about them since they haven't left, but they might not be coming back.

Suggested change
s.logger.Printf("[WARN] consul.rpc: RPC request for DC %q temporary failure", dc)
s.logger.Printf("[WARN] consul.rpc: RPC request for known DC %q, no servers available", dc)

Copy link
Contributor Author

@pierresouchay pierresouchay Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... in that case (not temporary), it means that someone is not doing it job (because a DC is definitely DOWN):

  • it means the whole remote DC is failing, someone should already being working on this
  • as explained below, I prefer to hide "the known DC"

What about:
s.logger.Printf("[WARN] consul.rpc: RPC request to DC %q is currently failing as no server can be reached", dc) ?

@@ -8,6 +8,7 @@ import (
const (
errNoLeader = "No cluster leader"
errNoDCPath = "No path to datacenter"
errDCNotAvailable = "Remote DC is temporary unavailable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Suggested change
errDCNotAvailable = "Remote DC is temporary unavailable"
errDCNotAvailable = "Known DC has no servers available"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced your proposal, not easy to interpret "Known" for an Operator...
I think the term remote is important... what about

Remote DC has no server currently reachable ?

@pierresouchay
Copy link
Contributor Author

@freddygv Thank you for the review, yes, I saw #5881 and was about to do something similar after this PR :-)
I renamed method IsDatacenterDefined to HasDatacenter as you suggested, but I made some counter proposal to new error messages in comments:

I you are Ok, I'll do a commit with those messages, if not, I'll take your suggestions

@pierresouchay
Copy link
Contributor Author

@freddygv I applied the changes, are you ok with the fix?
Regards

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@freddygv freddygv merged commit be50400 into hashicorp:master Sep 3, 2019
@pierresouchay
Copy link
Contributor Author

@freddygv Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants