-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: prefer fed state gateway definitions if they're fresher #11522
Conversation
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
agent/xds/endpoints.go
Outdated
if !ok { // not possible | ||
s.Logger.Error("skipping mesh gateway endpoints because no definition found", "datacenter", key) | ||
continue | ||
endpoints := cfgSnap.MeshGateway.GatewayGroups[key.String()].ShallowClone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about this alternative?
- collect the max inner
Service.RaftIndex.ModifyIndex
value from all MGW instances inGatewayGroups
- do the same for
FedStateGateways
instances - depending upon which of the two data sources has the larger value, wholesale use the entire slice of data that came from it
The argument here is that on their own, each of these data sets come from singular RPC calls awoken from single blocking queries. They are each independently self-consistent, whichever one has more recent data by necessity would be in aggregate more correct, so there's no need to do a deep merge like this. Doing the either/or merge instead would also be easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much easier to reason about, thanks! 🙌🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about a possible missing test, but other then that it LGTM!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/497440. |
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
…#11532) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
…#11534) Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue described in #10132, where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted. This was happening because we always preferred gateways discovered by the `Internal.ServiceDump` RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable.
Fixes an issue (described in #10132) where if two DCs are WAN federated over mesh gateways, and the gateway in the non-primary DC is terminated and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish its connection until the agent running on its own gateway is restarted.
This was happening because we always preferred gateways discovered by the
Internal.ServiceDump
RPC (which would fail because there's no way to dial the remote DC) over those discovered in the federation state, which is replicated as long as the primary DC's gateway is reachable - see the comment inendpointsFromSnapshotMeshGateway
.