Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TCP Keepalives to Envoy -> Contour xDS comms #1744

Closed
youngnick opened this issue Oct 21, 2019 · 5 comments · Fixed by #1746, #1758 or #1766
Closed

Add TCP Keepalives to Envoy -> Contour xDS comms #1744

youngnick opened this issue Oct 21, 2019 · 5 comments · Fixed by #1746, #1758 or #1766
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@youngnick
Copy link
Member

By default, Envoy will not tear down half open TCP connections for xDS, and will believe that it's connected.
See envoyproxy/envoy#6323 and envoyproxy/envoy#5173 for more background.

We should configure tcp keepalives for the Envoy -> Contour connection to ensure that half-open connections are flushed.

A sample config from envoyproxy#5173:

    static_resources:
      clusters:
      - name: {{ cluster_name }}_ads
        connect_timeout: { seconds: 1 }
        dns_refresh_rate: { seconds: 10 }
        type: STRICT_DNS
        ...
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5

Thanks to @ravilr for noting this one.

It's also possible that #1523 might be related to this.

@youngnick youngnick added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 21, 2019
@youngnick youngnick added this to the 1.0.0-rc.2 milestone Oct 21, 2019
@youngnick youngnick self-assigned this Oct 21, 2019
youngnick pushed a commit to youngnick/contour that referenced this issue Oct 21, 2019
Fixes projectcontour#1744.

Signed-off-by: Nick Young <ynick@vmware.com>
youngnick pushed a commit that referenced this issue Oct 21, 2019
Fixes #1744.

Signed-off-by: Nick Young <ynick@vmware.com>
@bgagnon
Copy link
Contributor

bgagnon commented Oct 21, 2019

@youngnick I was just about to comment on #1523 to say we were successful in reproducing the issue by turning off our forced EDS updates for ~48 hours -- this caused 15 out of 16 envoys to stop receiving updates from Contour.

I immediately started looking into timeout issues and found out the bootstrap was missing TCP keepalive options in 0.15. I was going to open the exact same merge request at #1746 next.

In any case, I'll add more details to #1523 but it's looking more and more like you've fixed it with b183d04 🎉

@bgagnon
Copy link
Contributor

bgagnon commented Oct 21, 2019

@youngnick sorry for asking, but do you think there's a chance this could be back-ported to the 0.15.x series? we'd like to test this important bug fix prior to our beta and 1.0 changes

youngnick pushed a commit to youngnick/contour that referenced this issue Oct 22, 2019
Fixes projectcontour#1744

Signed-off-by: Nick Young <ynick@vmware.com>
youngnick pushed a commit to youngnick/contour that referenced this issue Oct 22, 2019
Fixes projectcontour#1744

Signed-off-by: Nick Young <ynick@vmware.com>
youngnick pushed a commit that referenced this issue Oct 22, 2019
Fixes #1744

Signed-off-by: Nick Young <ynick@vmware.com>
@bgagnon
Copy link
Contributor

bgagnon commented Oct 22, 2019

Should the PermitWithoutStream option be enabled as well? This is what we've put in place in our XDS proxy:

	grpcServer := grpc.NewServer(
		grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
			PermitWithoutStream: true, // Allow pings even when there are no active streams
		}),
		grpc.KeepaliveParams(keepalive.ServerParameters{
			Time:    60 * time.Second, // Ping the client if it is idle for 60 seconds to ensure the connection is still active
			Timeout: 5 * time.Second,  // Wait 5 seconds for the ping ack before assuming the connection is dead
		}))

Given the design of XDS I guess we almost always have a stream open... so I am not sure about this.

@davecheney
Copy link
Contributor

@bgagnon this probably won't hurt.

@youngnick
Copy link
Member Author

@bgagnon thanks, I missed that. D'oh.

@youngnick youngnick reopened this Oct 22, 2019
youngnick pushed a commit to youngnick/contour that referenced this issue Oct 22, 2019
Fixes projectcontour#1744 properly this time.

Signed-off-by: Nick Young <ynick@vmware.com>
youngnick pushed a commit that referenced this issue Oct 22, 2019
Fixes #1744 properly this time.

Signed-off-by: Nick Young <ynick@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
3 participants