Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forwardRPC and globalRPC do not take LAN ServerLookup into account when the target datacenter is the local datacenter #8403

Closed
mkeeler opened this issue Jul 29, 2020 · 2 comments
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected

Comments

@mkeeler
Copy link
Member

mkeeler commented Jul 29, 2020

consul/agent/consul/rpc.go

Lines 604 to 669 in 4c8a15b

// forwardDC is used to forward an RPC call to a remote DC, or fail if no servers
func (s *Server) forwardDC(method, dc string, args interface{}, reply interface{}) error {
manager, server, ok := s.router.FindRoute(dc)
if !ok {
if s.router.HasDatacenter(dc) {
s.rpcLogger().Warn("RPC request to DC is currently failing as no server can be reached", "datacenter", dc)
return structs.ErrDCNotAvailable
}
s.rpcLogger().Warn("RPC request for DC is currently failing as no path was found",
"datacenter", dc,
"method", method,
)
return structs.ErrNoDCPath
}
metrics.IncrCounterWithLabels([]string{"rpc", "cross-dc"}, 1,
[]metrics.Label{{Name: "datacenter", Value: dc}})
if err := s.connPool.RPC(dc, server.ShortName, server.Addr, method, args, reply); err != nil {
manager.NotifyFailedServer(server)
s.rpcLogger().Error("RPC failed to server in DC",
"server", server.Addr,
"datacenter", dc,
"method", method,
"error", err,
)
return err
}
return nil
}
// globalRPC is used to forward an RPC request to one server in each datacenter.
// This will only error for RPC-related errors. Otherwise, application-level
// errors can be sent in the response objects.
func (s *Server) globalRPC(method string, args interface{},
reply structs.CompoundResponse) error {
// Make a new request into each datacenter
dcs := s.router.GetDatacenters()
replies, total := 0, len(dcs)
errorCh := make(chan error, total)
respCh := make(chan interface{}, total)
for _, dc := range dcs {
go func(dc string) {
rr := reply.New()
if err := s.forwardDC(method, dc, args, &rr); err != nil {
errorCh <- err
return
}
respCh <- rr
}(dc)
}
for replies < total {
select {
case err := <-errorCh:
return err
case rr := <-respCh:
reply.Add(rr)
replies++
}
}
return nil
}

When performing a keyring listing we use the globalRPC function to issue the RPC against all datacenters and collect the results. globalRPC in turn just calls forwardDC on all known datacenters including the local DC.

This doesn't sound too bad but when coupled with #8401 it can cause some issues. Mainly that in a single node cluster (or multi-node where the servers WAN gossip ports are firewalled from each other) the router being used by forwardRPC to lookup a server in the local DC will report that there are no reachable servers in the datacenter and fail to perform the RPC.

We should either 1) exclude the local DC from the globalRPC call 2) ensure that forwardRPC uses the ServerLookup instead of the Router for finding a server to use in the local datacenter or 3) avoid all the server lookups for the local DC and just field the RPC locally.

Right now I am leaning toward the second option as it seems most robust.

@dnephin dnephin added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Jul 29, 2020
hanshasselberg added a commit that referenced this issue Jul 30, 2020
This code started as an optimization to avoid doing an RPC Ping to
itself. But in a single server cluster the rebalancing was led to
believe that there were no healthy servers because foundHealthyServer
was not set. Now this is being set properly.

Fixes #8401 and #8403.
hanshasselberg added a commit that referenced this issue Aug 6, 2020
This code started as an optimization to avoid doing an RPC Ping to
itself. But in a single server cluster the rebalancing was led to
believe that there were no healthy servers because foundHealthyServer
was not set. Now this is being set properly.

Fixes #8401 and #8403.
hashicorp-ci pushed a commit that referenced this issue Aug 6, 2020
This code started as an optimization to avoid doing an RPC Ping to
itself. But in a single server cluster the rebalancing was led to
believe that there were no healthy servers because foundHealthyServer
was not set. Now this is being set properly.

Fixes #8401 and #8403.
@hanshasselberg
Copy link
Member

This is the globalRPC function, it uses forwardDC even for its own dc:

consul/agent/consul/rpc.go

Lines 641 to 660 in d1c879e

func (s *Server) globalRPC(method string, args interface{},
reply structs.CompoundResponse) error {
// Make a new request into each datacenter
dcs := s.router.GetDatacenters()
replies, total := 0, len(dcs)
errorCh := make(chan error, total)
respCh := make(chan interface{}, total)
for _, dc := range dcs {
go func(dc string) {
rr := reply.New()
if err := s.forwardDC(method, dc, args, &rr); err != nil {
errorCh <- err
return
}
respCh <- rr
}(dc)
}

We were unsure if #8406 fixes this issue. After this fix, single node dcs are healthy and a single server marks itself as healthy. That means that even a call through the router should be successful.

@mkeeler suggested to

ensure that forwardRPC uses the ServerLookup instead of the Router for finding a server to use in the local datacenter

forwardRPC is only "misused" from globalRPC and in the other cases it properly checks if forwarding is required. Changing this function to do the same check again doesn't seem right to me. If we were to optimize globalRPC I would vote for checking against the local dc in this function and field the RPC directly.

I agree that a server shouldn't RPC itself, but apart from this optimization, there is no bug anymore.

@mkeeler
Copy link
Member Author

mkeeler commented Aug 10, 2020

So it turns out that the server does need to RPC itself.

The first server that gets the keyring request will update the WAN keyring and then issue globalRPC. That RPC will update the LAN keyring of each DC with the sent RPC. Therefore in order to update the LAN keyring in the local DC the second RPC is done.

Do we need an RPC to do this: No.
Is the code a little convoluted: Yes.
Is it functionally correct: Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants