`forwardRPC` and `globalRPC` do not take LAN `ServerLookup` into account when the target datacenter is the local datacenter #8403

mkeeler · 2020-07-29T19:45:32Z

Lines 604 to 669 in 4c8a15b

    
           // forwardDC is used to forward an RPC call to a remote DC, or fail if no servers 
        
           func (s *Server) forwardDC(method, dc string, args interface{}, reply interface{}) error { 
        
           	manager, server, ok := s.router.FindRoute(dc) 
        
           	if !ok { 
        
           		if s.router.HasDatacenter(dc) { 
        
           			s.rpcLogger().Warn("RPC request to DC is currently failing as no server can be reached", "datacenter", dc) 
        
           			return structs.ErrDCNotAvailable 
        
           		} 
        
           		s.rpcLogger().Warn("RPC request for DC is currently failing as no path was found", 
        
           			"datacenter", dc, 
        
           			"method", method, 
        
           		) 
        
           		return structs.ErrNoDCPath 
        
           	} 
        
           	metrics.IncrCounterWithLabels([]string{"rpc", "cross-dc"}, 1, 
        
           		[]metrics.Label{{Name: "datacenter", Value: dc}}) 
        
           	if err := s.connPool.RPC(dc, server.ShortName, server.Addr, method, args, reply); err != nil { 
        
           		manager.NotifyFailedServer(server) 
        
           		s.rpcLogger().Error("RPC failed to server in DC", 
        
           			"server", server.Addr, 
        
           			"datacenter", dc, 
        
           			"method", method, 
        
           			"error", err, 
        
           		) 
        
           		return err 
        
           	} 
        
           	return nil 
        
           } 
        
           // globalRPC is used to forward an RPC request to one server in each datacenter. 
        
           // This will only error for RPC-related errors. Otherwise, application-level 
        
           // errors can be sent in the response objects. 
        
           func (s *Server) globalRPC(method string, args interface{}, 
        
           	reply structs.CompoundResponse) error { 
        
           	// Make a new request into each datacenter 
        
           	dcs := s.router.GetDatacenters() 
        
           	replies, total := 0, len(dcs) 
        
           	errorCh := make(chan error, total) 
        
           	respCh := make(chan interface{}, total) 
        
           	for _, dc := range dcs { 
        
           		go func(dc string) { 
        
           			rr := reply.New() 
        
           			if err := s.forwardDC(method, dc, args, &rr); err != nil { 
        
           				errorCh <- err 
        
           				return 
        
           			} 
        
           			respCh <- rr 
        
           		}(dc) 
        
           	} 
        
           	for replies < total { 
        
           		select { 
        
           		case err := <-errorCh: 
        
           			return err 
        
           		case rr := <-respCh: 
        
           			reply.Add(rr) 
        
           			replies++ 
        
           		} 
        
           	} 
        
           	return nil 
        
           }

When performing a keyring listing we use the globalRPC function to issue the RPC against all datacenters and collect the results. globalRPC in turn just calls forwardDC on all known datacenters including the local DC.

This doesn't sound too bad but when coupled with #8401 it can cause some issues. Mainly that in a single node cluster (or multi-node where the servers WAN gossip ports are firewalled from each other) the router being used by forwardRPC to lookup a server in the local DC will report that there are no reachable servers in the datacenter and fail to perform the RPC.

We should either 1) exclude the local DC from the globalRPC call 2) ensure that forwardRPC uses the ServerLookup instead of the Router for finding a server to use in the local datacenter or 3) avoid all the server lookups for the local DC and just field the RPC locally.

Right now I am leaning toward the second option as it seems most robust.

The text was updated successfully, but these errors were encountered:

This code started as an optimization to avoid doing an RPC Ping to itself. But in a single server cluster the rebalancing was led to believe that there were no healthy servers because foundHealthyServer was not set. Now this is being set properly. Fixes #8401 and #8403.

hanshasselberg · 2020-08-10T11:47:45Z

This is the globalRPC function, it uses forwardDC even for its own dc:

consul/agent/consul/rpc.go

Lines 641 to 660 in d1c879e

    
           func (s *Server) globalRPC(method string, args interface{}, 
        
           	reply structs.CompoundResponse) error { 
        
           	// Make a new request into each datacenter 
        
           	dcs := s.router.GetDatacenters() 
        
           	replies, total := 0, len(dcs) 
        
           	errorCh := make(chan error, total) 
        
           	respCh := make(chan interface{}, total) 
        
           	for _, dc := range dcs { 
        
           		go func(dc string) { 
        
           			rr := reply.New() 
        
           			if err := s.forwardDC(method, dc, args, &rr); err != nil { 
        
           				errorCh <- err 
        
           				return 
        
           			} 
        
           			respCh <- rr 
        
           		}(dc) 
        
           	}

We were unsure if #8406 fixes this issue. After this fix, single node dcs are healthy and a single server marks itself as healthy. That means that even a call through the router should be successful.

@mkeeler suggested to

ensure that forwardRPC uses the ServerLookup instead of the Router for finding a server to use in the local datacenter

forwardRPC is only "misused" from globalRPC and in the other cases it properly checks if forwarding is required. Changing this function to do the same check again doesn't seem right to me. If we were to optimize globalRPC I would vote for checking against the local dc in this function and field the RPC directly.

I agree that a server shouldn't RPC itself, but apart from this optimization, there is no bug anymore.

mkeeler · 2020-08-10T20:15:25Z

So it turns out that the server does need to RPC itself.

The first server that gets the keyring request will update the WAN keyring and then issue globalRPC. That RPC will update the LAN keyring of each DC with the sent RPC. Therefore in order to update the LAN keyring in the local DC the second RPC is done.

Do we need an RPC to do this: No.
Is the code a little convoluted: Yes.
Is it functionally correct: Yes.

dnephin added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Jul 29, 2020

hanshasselberg mentioned this issue Jul 30, 2020

Mark its own cluster as healthy when rebalancing. #8406

Merged

hanshasselberg closed this as completed Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`forwardRPC` and `globalRPC` do not take LAN `ServerLookup` into account when the target datacenter is the local datacenter #8403

`forwardRPC` and `globalRPC` do not take LAN `ServerLookup` into account when the target datacenter is the local datacenter #8403

mkeeler commented Jul 29, 2020

hanshasselberg commented Aug 10, 2020

mkeeler commented Aug 10, 2020

forwardRPC and globalRPC do not take LAN ServerLookup into account when the target datacenter is the local datacenter #8403

forwardRPC and globalRPC do not take LAN ServerLookup into account when the target datacenter is the local datacenter #8403

Comments

mkeeler commented Jul 29, 2020

hanshasselberg commented Aug 10, 2020

mkeeler commented Aug 10, 2020

`forwardRPC` and `globalRPC` do not take LAN `ServerLookup` into account when the target datacenter is the local datacenter #8403

`forwardRPC` and `globalRPC` do not take LAN `ServerLookup` into account when the target datacenter is the local datacenter #8403