xds: prefer fed state gateway definitions if they're fresher #11522

boxofrad · 2021-11-08T14:55:58Z

Fixes an issue (described in #10132) where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by the Internal.ServiceDump RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable - see the comment in endpointsFromSnapshotMeshGateway.

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

agent/xds/endpoints.go

eculver · 2021-11-08T17:03:49Z

Nice work. This looks correct to me. I'd try to get @freddygv and @rboyer to sign off too.

agent/xds/endpoints.go

rboyer · 2021-11-08T17:43:52Z

agent/xds/endpoints.go

-			if !ok { // not possible
-				s.Logger.Error("skipping mesh gateway endpoints because no definition found", "datacenter", key)
-				continue
+		endpoints := cfgSnap.MeshGateway.GatewayGroups[key.String()].ShallowClone()


How would you feel about this alternative?

collect the max inner Service.RaftIndex.ModifyIndex value from all MGW instances in GatewayGroups

do the same for FedStateGateways instances

depending upon which of the two data sources has the larger value, wholesale use the entire slice of data that came from it

The argument here is that on their own, each of these data sets come from singular RPC calls awoken from single blocking queries. They are each independently self-consistent, whichever one has more recent data by necessity would be in aggregate more correct, so there's no need to do a deep merge like this. Doing the either/or merge instead would also be easier to grok.

That's much easier to reason about, thanks! 🙌🏻

dhiaayachi

I have a question about a possible missing test, but other then that it LGTM!!

agent/proxycfg/testing.go

rboyer

LGTM

hc-github-team-consul-core · 2021-11-09T16:46:19Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/497440.

hc-github-team-consul-core · 2021-11-09T16:46:23Z

🍒❌ Cherry pick of commit 50a1f20 onto release/1.10.x failed! Build Log

hc-github-team-consul-core · 2021-11-09T16:46:24Z

🍒❌ Cherry pick of commit 50a1f20 onto release/1.9.x failed! Build Log

hc-github-team-consul-core · 2021-11-09T16:46:27Z

🍒❌ Cherry pick of commit 50a1f20 onto release/1.8.x failed! Build Log

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

…#11532) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

…#11534) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.

github-actions bot added the theme/envoy/xds Related to Envoy support label Nov 8, 2021

boxofrad commented Nov 8, 2021

View reviewed changes

agent/xds/endpoints.go Outdated Show resolved Hide resolved

boxofrad added 2 commits November 8, 2021 15:43

Slightly easier to read conditional

ef68264

Changelog entry

19f68d9

vercel bot temporarily deployed to Preview – consul-ui-staging November 8, 2021 15:51 Inactive

vercel bot temporarily deployed to Preview – consul November 8, 2021 15:51 Inactive

boxofrad requested a review from rboyer November 8, 2021 15:52

rboyer reviewed Nov 8, 2021

View reviewed changes

agent/xds/endpoints.go Outdated Show resolved Hide resolved

rboyer reviewed Nov 8, 2021

View reviewed changes

Pick whichever set of gateways contains the highest ModifyIndex

f9989cd

vercel bot temporarily deployed to Preview – consul November 9, 2021 10:56 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging November 9, 2021 10:56 Inactive

boxofrad requested a review from rboyer November 9, 2021 10:57

boxofrad added backport/1.10 labels Nov 9, 2021

dhiaayachi approved these changes Nov 9, 2021

View reviewed changes

agent/proxycfg/testing.go Show resolved Hide resolved

Add test for when the data in FedStateGateways is staler

75b3b61

vercel bot temporarily deployed to Preview – consul November 9, 2021 16:24 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging November 9, 2021 16:24 Inactive

rboyer approved these changes Nov 9, 2021

View reviewed changes

boxofrad merged commit 50a1f20 into main Nov 9, 2021

boxofrad deleted the boxofrad/issue-10132 branch November 9, 2021 16:45

boxofrad mentioned this pull request Nov 9, 2021

mesh-federated wan clusters do not reconnect on an secondary mesh-gateway outage #10132

Closed

boxofrad mentioned this pull request Nov 9, 2021

Backport 1.10.x: xds: prefer fed state gateway definitions if they're fresher #11531

Merged

boxofrad mentioned this pull request Nov 9, 2021

Backport 1.9.x: xds: prefer fed state gateway definitions if they're fresher #11532

Merged

boxofrad mentioned this pull request Nov 9, 2021

Backport 1.8.x: xds: prefer fed state gateway definitions if they're fresher #11534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xds: prefer fed state gateway definitions if they're fresher #11522

xds: prefer fed state gateway definitions if they're fresher #11522

boxofrad commented Nov 8, 2021

eculver commented Nov 8, 2021

rboyer Nov 8, 2021

boxofrad Nov 9, 2021

dhiaayachi left a comment

rboyer left a comment

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021

xds: prefer fed state gateway definitions if they're fresher #11522

xds: prefer fed state gateway definitions if they're fresher #11522

Conversation

boxofrad commented Nov 8, 2021

eculver commented Nov 8, 2021

rboyer Nov 8, 2021

Choose a reason for hiding this comment

boxofrad Nov 9, 2021

Choose a reason for hiding this comment

dhiaayachi left a comment

Choose a reason for hiding this comment

rboyer left a comment

Choose a reason for hiding this comment

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021

hc-github-team-consul-core commented Nov 9, 2021