Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: prefer fed state gateway definitions if they're fresher #11522

Merged
merged 5 commits into from
Nov 9, 2021

Conversation

boxofrad
Copy link
Contributor

@boxofrad boxofrad commented Nov 8, 2021

Fixes an issue (described in #10132) where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by the Internal.ServiceDump RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable - see the comment in endpointsFromSnapshotMeshGateway.

Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
@github-actions github-actions bot added the theme/envoy/xds Related to Envoy support label Nov 8, 2021
agent/xds/endpoints.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging November 8, 2021 15:51 Inactive
@vercel vercel bot temporarily deployed to Preview – consul November 8, 2021 15:51 Inactive
@boxofrad boxofrad requested a review from rboyer November 8, 2021 15:52
@eculver
Copy link
Contributor

eculver commented Nov 8, 2021

Nice work. This looks correct to me. I'd try to get @freddygv and @rboyer to sign off too.

agent/xds/endpoints.go Outdated Show resolved Hide resolved
if !ok { // not possible
s.Logger.Error("skipping mesh gateway endpoints because no definition found", "datacenter", key)
continue
endpoints := cfgSnap.MeshGateway.GatewayGroups[key.String()].ShallowClone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about this alternative?

  1. collect the max inner Service.RaftIndex.ModifyIndex value from all MGW instances in GatewayGroups
  2. do the same for FedStateGateways instances
  3. depending upon which of the two data sources has the larger value, wholesale use the entire slice of data that came from it

The argument here is that on their own, each of these data sets come from singular RPC calls awoken from single blocking queries. They are each independently self-consistent, whichever one has more recent data by necessity would be in aggregate more correct, so there's no need to do a deep merge like this. Doing the either/or merge instead would also be easier to grok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much easier to reason about, thanks! 🙌🏻

@vercel vercel bot temporarily deployed to Preview – consul November 9, 2021 10:56 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging November 9, 2021 10:56 Inactive
@boxofrad boxofrad requested a review from rboyer November 9, 2021 10:57
Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about a possible missing test, but other then that it LGTM!!

agent/proxycfg/testing.go Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul November 9, 2021 16:24 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging November 9, 2021 16:24 Inactive
Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boxofrad boxofrad merged commit 50a1f20 into main Nov 9, 2021
@boxofrad boxofrad deleted the boxofrad/issue-10132 branch November 9, 2021 16:45
@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/497440.

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 50a1f20 onto release/1.10.x failed! Build Log

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 50a1f20 onto release/1.9.x failed! Build Log

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 50a1f20 onto release/1.8.x failed! Build Log

boxofrad added a commit that referenced this pull request Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this pull request Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this pull request Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this pull request Nov 9, 2021
…#11532)

Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
dhiaayachi pushed a commit that referenced this pull request Nov 9, 2021
…#11534)

Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this pull request Nov 9, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
boxofrad added a commit that referenced this pull request Nov 10, 2021
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.

This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/envoy/xds Related to Envoy support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants