Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: ensure that streaming gRPC requests work over mesh gateway based wan federation #10838

Merged
merged 9 commits into from
Aug 24, 2021

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Aug 12, 2021

Fixes #10796

@rboyer rboyer added type/bug Feature does not function as expected theme/streaming Related to Streaming connections between server and client backport/1.10 labels Aug 12, 2021
@rboyer rboyer requested a review from a team August 12, 2021 18:40
@rboyer rboyer self-assigned this Aug 12, 2021
@vercel vercel bot temporarily deployed to Preview – consul August 12, 2021 18:45 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 12, 2021 18:45 Inactive
@@ -4524,6 +4527,9 @@ LOOP:
}

// This is a mirror of a similar test in agent/consul/server_test.go
//
// TODO(rb): implement something similar to this as a full containerized test suite with proper
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this TODO, since I could not find a way to do the "firewalling" of the servers from each other in-process.

@@ -24,8 +24,10 @@ type Deps struct {
type GRPCClientConner interface {
ClientConn(datacenter string) (*grpc.ClientConn, error)
ClientConnLeader() (*grpc.ClientConn, error)
SetGatewayResolver(func(string) string)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grpc client pool is created in agent/setup.go universally for servers and clients, but only servers will eventually create a GatewayResolver for themselves in agent/consul/server.go. For the regular connection pool for RPCs the gw resolver field is set after the struct was created (but before it's used). I did the same here to avoid having to heavily refactor the agent construction code.

@@ -293,7 +293,7 @@ func (s *Server) handleNativeTLS(conn net.Conn) {
s.handleSnapshotConn(tlsConn)

case pool.ALPN_RPCGRPC:
s.grpcHandler.Handle(conn)
s.grpcHandler.Handle(tlsConn)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an existing bug likely due to a simple copypaste gone awry. All of the other branches take a tlsConn instead of a conn to be on the correct side of the multiplex envelope.

@@ -390,6 +394,7 @@ func NewServer(config *Config, flat Deps) (*Server, error) {
s.config.PrimaryDatacenter,
)
s.connPool.GatewayResolver = s.gatewayLocator.PickGateway
s.grpcConnPool.SetGatewayResolver(s.gatewayLocator.PickGateway)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the late-set mentioned earlier.

}

return tlsConn, nil
func (t *Transport) dial(dc, nodeName, nextProto string) (net.Conn, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ending up with 3 similar copies of the same function structure I refactored stuff so that now there's only one implementation of this dialing approach shared by all 3 locations.

s.lock.RLock()
defer s.lock.RUnlock()

for _, server := range s.servers {
if server.Addr.String() == addr {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formerly addr was just the bare ip address of the server. When using wanfed over mesh gateways it's technically completely fine for a single server in each datacenter to have the same ip address, so it's no longer ok to just attempt to lookup a server based on the ip address alone.

Similar to how the grpc resolver logic prefixes the server IDs with the datacenter, I'm just prefixing these addrs with the datacenter since we also control the dialing side and can decode these as necessary before they actually get used to open sockets.

RPCMultiplexV2 = 4
RPCSnapshot = 5
RPCGossip = 6
RPCRaft RPCType = 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since an iota isn't used here, only RPCConsul had the type RPCType. Something I did in this PR surfaced this issue so I just fixed it in place.

agent/pool/pool.go Outdated Show resolved Hide resolved
if tcp, ok := rawConn.(*net.TCPConn); ok {
_ = tcp.SetKeepAlive(true)
_ = tcp.SetNoDelay(true)
if nextProto != ALPN_RPCGRPC {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I saw a reference in the grpc setup code to keepalives being controlled elsewhere, so since this wasn't being done in the existing grpc pool code I opted to keep this section off for grpc.

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 12, 2021 19:20 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 12, 2021 19:20 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 17, 2021 20:47 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 17, 2021 20:47 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 17, 2021 20:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 17, 2021 20:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 18, 2021 15:16 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 18, 2021 15:16 Inactive
agent/consul/client_test.go Outdated Show resolved Hide resolved
agent/grpc/resolver/resolver.go Outdated Show resolved Hide resolved
agent/pool/pool.go Outdated Show resolved Hide resolved
agent/grpc/client.go Outdated Show resolved Hide resolved
agent/pool/pool.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul August 20, 2021 20:21 Inactive
@rboyer rboyer requested a review from freddygv August 20, 2021 20:23
return conn, err
}

d := net.Dialer{LocalAddr: cfg.SrcAddr, Timeout: pool.DefaultDialTimeout}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the non-gateway version of grpc also have a non-infinite dial timeout like all of the others.

)

const defaultDialTimeout = 10 * time.Second
const DefaultDialTimeout = 10 * time.Second
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported this so it could be used from the agent/grpc package as well.

gwAddr := gatewayResolver(dc)
if gwAddr == "" {
return nil, nil, structs.ErrDCNotAvailable
}

dialer := &net.Dialer{LocalAddr: src, Timeout: defaultDialTimeout}
dialer := &net.Dialer{LocalAddr: srcAddr, Timeout: DefaultDialTimeout}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All rpc types get a 10s timeout now.

@vercel vercel bot temporarily deployed to Preview – consul August 20, 2021 21:16 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 20, 2021 21:16 Inactive
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

agent/consul/client_test.go Outdated Show resolved Hide resolved
…lient-to-server so this does not make sense to do
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 24, 2021 15:47 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 24, 2021 15:47 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 24, 2021 16:00 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 24, 2021 16:00 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 24, 2021 20:41 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 24, 2021 20:41 Inactive
@rboyer
Copy link
Member Author

rboyer commented Aug 24, 2021

@freddygv i added a test in agent/grpc that covers having the grpc dialing code correctly dial a server via a gateway using the ALPN approach

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 24, 2021 20:57 Inactive
@vercel vercel bot temporarily deployed to Preview – consul August 24, 2021 20:57 Inactive
@rboyer rboyer merged commit 5b6d96d into main Aug 24, 2021
@rboyer rboyer deleted the fix-grpc-over-mgw-wanfed branch August 24, 2021 21:28
@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/432909.

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 5b6d96d onto release/1.10.x failed! Build Log

rboyer added a commit that referenced this pull request Aug 24, 2021
…eway based wan federation

Backport of #10838 to 1.10.x
rboyer added a commit that referenced this pull request Aug 24, 2021
…eway based wan federation

Backport of #10838 to 1.10.x
rboyer added a commit that referenced this pull request Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/streaming Related to Streaming connections between server and client type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consul 1.10.1 does not support WAN-fed-via-mesh-gateways without disabling streaming
3 participants