Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: prefer fed state gateway definitions if they're fresher #11522

Merged
merged 5 commits into from
Nov 9, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .changelog/11522.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```release-note:bug
xds: fixes a bug where replacing a mesh gateway node used for WAN federation (with another that has a different IP) could leave gateways in the other DC unable to re-establish the connection
```
41 changes: 41 additions & 0 deletions agent/proxycfg/testing.go
Original file line number Diff line number Diff line change
Expand Up @@ -1506,6 +1506,47 @@ func TestConfigSnapshotMeshGatewayUsingFederationStates(t testing.T) *ConfigSnap
return testConfigSnapshotMeshGateway(t, true, true)
}

func TestConfigSnapshotMeshGatewayNewerInformationInFederationStates(t testing.T) *ConfigSnapshot {
dhiaayachi marked this conversation as resolved.
Show resolved Hide resolved
snap := TestConfigSnapshotMeshGateway(t)

csn1 := snap.MeshGateway.GatewayGroups["dc2"][0]
csn2 := snap.MeshGateway.GatewayGroups["dc2"][1]

// Create a duplicate entry in FedStateGateways, with a higher ModifyIndex, to
// verify that its IP is chosen over the one in GatewayGroups (198.18.1.1).
svc1 := structs.TestNodeServiceMeshGatewayWithAddrs(t,
"10.0.1.3", 8443,
structs.ServiceAddress{Address: "10.0.1.3", Port: 8443},
structs.ServiceAddress{Address: "198.18.1.3", Port: 443},
)
svc1.RaftIndex.ModifyIndex = csn1.Service.ModifyIndex + 1

// Create another duplicate entry, this time without increasing its ModifyIndex,
// to verify that the IP in GatewayGroups (198.18.1.2) is chosen.
svc2 := structs.TestNodeServiceMeshGatewayWithAddrs(t,
"10.0.1.4", 8443,
structs.ServiceAddress{Address: "10.0.1.4", Port: 8443},
structs.ServiceAddress{Address: "198.18.1.4", Port: 443},
)

snap.MeshGateway.FedStateGateways = map[string]structs.CheckServiceNodes{
"dc2": {
{
Node: csn1.Node,
Service: svc1,
Checks: csn1.Checks,
},
{
Node: csn2.Node,
Service: svc2,
Checks: csn2.Checks,
},
},
}

return snap
}

func TestConfigSnapshotMeshGatewayNoServices(t testing.T) *ConfigSnapshot {
return testConfigSnapshotMeshGateway(t, false, false)
}
Expand Down
61 changes: 55 additions & 6 deletions agent/xds/endpoints.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,64 @@ func (s *ResourceGenerator) endpointsFromSnapshotMeshGateway(cfgSnap *proxycfg.C
continue
}

endpoints, ok := cfgSnap.MeshGateway.GatewayGroups[key.String()]
if !ok {
endpoints, ok = cfgSnap.MeshGateway.FedStateGateways[key.String()]
if !ok { // not possible
s.Logger.Error("skipping mesh gateway endpoints because no definition found", "datacenter", key)
continue
endpoints := cfgSnap.MeshGateway.GatewayGroups[key.String()].ShallowClone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about this alternative?

  1. collect the max inner Service.RaftIndex.ModifyIndex value from all MGW instances in GatewayGroups
  2. do the same for FedStateGateways instances
  3. depending upon which of the two data sources has the larger value, wholesale use the entire slice of data that came from it

The argument here is that on their own, each of these data sets come from singular RPC calls awoken from single blocking queries. They are each independently self-consistent, whichever one has more recent data by necessity would be in aggregate more correct, so there's no need to do a deep merge like this. Doing the either/or merge instead would also be easier to grok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much easier to reason about, thanks! 🙌🏻


// Merge in gateways from the federation state, handling duplicates by picking
// the one with the greatest ModifyIndex.
//
// Mesh gateways in remote DCs are discovered in two ways:
//
// 1. Via an Internal.ServiceDump RPC in the remote DC (GatewayGroups).
// 2. In the federation state that is replicated from the primary DC (FedStateGateways).
//
// Previously, GatewayGroups was always given presedence over FedStateGateways
// but this is problematic when using mesh gateways for WAN federation.
//
// Consider the following example:
//
// - Primary and Secondary DCs are WAN Federated via local mesh gateways.
//
// - Secondary DC's mesh gateway is running on an ephemeral compute instance
// and is abruptly terminated and rescheduled with a *new IP address*.
//
// - Primary DC's mesh gateway is no longer able to connect to the Secondary
// DC as its proxy is configured with the old IP address. Therefore any RPC
// from the Primary to the Secondary DC will fail (including the one to
// discover the gateway's new IP address).
//
// - Secondary DC performs its regular anti-entropy of federation state data
// to the Primary DC (this succeeds as there is still connectivity in this
// direction).
//
// - At this point the Primary DC's mesh gateway should observe the new IP
// address and reconfigure its proxy, however as we always prioritised
// GatewayGroups this didn't happen and the connection remained severed.
fedStateEndpoints := cfgSnap.MeshGateway.FedStateGateways[key.String()]
for _, fse := range fedStateEndpoints {
idx := -1

for i, e := range endpoints {
boxofrad marked this conversation as resolved.
Show resolved Hide resolved
if e.Service.Service == fse.Service.Service && e.Node.ID == fse.Node.ID {
idx = i
break
}
}

switch {
case idx == -1:
// Definition is only present in the federation state, add it wholesale.
endpoints = append(endpoints, fse)
case endpoints[idx].Service.RaftIndex.ModifyIndex < fse.Service.RaftIndex.ModifyIndex:
boxofrad marked this conversation as resolved.
Show resolved Hide resolved
// Definition in the federation state is fresher, prefer it over the other.
endpoints[idx] = fse
}
}

if len(endpoints) == 0 {
s.Logger.Error("skipping mesh gateway endpoints because no definition found", "datacenter", key)
continue
}

{ // standard connect
clusterName := connect.GatewaySNI(key.Datacenter, key.Partition, cfgSnap.Roots.TrustDomain)

Expand Down
4 changes: 4 additions & 0 deletions agent/xds/endpoints_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ func TestEndpointsFromSnapshot(t *testing.T) {
create: proxycfg.TestConfigSnapshotMeshGatewayUsingFederationStates,
setup: nil,
},
{
name: "mesh-gateway-newer-information-in-federation-states",
create: proxycfg.TestConfigSnapshotMeshGatewayNewerInformationInFederationStates,
},
{
name: "mesh-gateway-no-services",
create: proxycfg.TestConfigSnapshotMeshGatewayNoServices,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
{
"versionInfo": "00000001",
"resources": [
{
"@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
"clusterName": "bar.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul",
"endpoints": [
{
"lbEndpoints": [
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.6",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.7",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.8",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
}
]
}
]
},
{
"@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
"clusterName": "dc2.internal.11111111-2222-3333-4444-555555555555.consul",
"endpoints": [
{
"lbEndpoints": [
{
"endpoint": {
"address": {
"socketAddress": {
"address": "198.18.1.3",
"portValue": 443
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "198.18.1.2",
"portValue": 443
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
}
]
}
]
},
{
"@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
"clusterName": "foo.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul",
"endpoints": [
{
"lbEndpoints": [
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.3",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.4",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.5",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
},
{
"endpoint": {
"address": {
"socketAddress": {
"address": "172.16.1.9",
"portValue": 2222
}
}
},
"healthStatus": "HEALTHY",
"loadBalancingWeight": 1
}
]
}
]
}
],
"typeUrl": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
"nonce": "00000001"
}