Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: initialize mgw-wanfed to use local gateways more on startup #9528

Merged
merged 4 commits into from
Jan 25, 2021

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Jan 7, 2021

Fixes #9342

@rboyer rboyer requested a review from a team January 7, 2021 22:09
@rboyer rboyer self-assigned this Jan 7, 2021
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic alignment thing seems worth a quick fix although it's in test code unless I'm mistaken there.

Overall I think I understand what's going on here and the tests look thorough. I'm not super confident I understand the intricacies of the WAN Fed process enough to be sure this is correct or complete overall but it appears to do something shaped like a solution to the issue you filed!

I think if you have specific concerns or uncertainties about it then I'd be happy to spend more time digging through and trying things out to give a more confident approval, but I trust your judgement if you have high confidence this is correct now!

agent/consul/gateway_locator_test.go Outdated Show resolved Hide resolved
@@ -93,17 +106,25 @@ func TestGatewayLocator(t *testing.T) {
)
g.SetUseReplicationSignal(isLeader)

t.Run("before first run", func(t *testing.T) {
assert.True(t, g.DialPrimaryThroughLocalGateway()) // defaults to sure!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone not super familiar with this code. It's not very clear to me why this test case should expect to dial through gateways when it has no data?

Doesn't that mean it doesn't know where primary gateways are yet and so it should try servers direct?

If this is correct because we just opportunistically dial gateways and fail if there are none a comment here might help!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DialPrimaryThroughLocalGateway just keeps track of the simple failure detector logic and does not include information about federation state catalog data.

It primarily gets used this way:

		if g.datacenter == g.primaryDatacenter {
			addrs = g.primaryGateways
		} else if g.DialPrimaryThroughLocalGateway() && len(g.localGateways) > 0 {
			addrs = g.localGateways
		} else {
			// Note calling StringSliceMergeSorted only works because both
			// inputs are pre-sorted. If for some reason one of the lists has
			// *duplicates* (which shouldn't happen) it's not great but it
			// won't break anything other than biasing our eventual random
			// choice a little bit.
			addrs = stringslice.MergeSorted(g.primaryGateways, g.PrimaryGatewayFallbackAddresses())
		}

In this PR we explicitly initialize this failure detector to the success state on startup:

@@ -283,6 +289,8 @@ func NewGatewayLocator(
                primaryGatewaysReadyCh: make(chan struct{}),
        }
        g.logPrimaryDialingMessage(g.DialPrimaryThroughLocalGateway())
+       // initialize
+       g.SetLastFederationStateReplicationError(nil, false)
        return g
 }

before actually making an observation. That's the main change. This means that the gateway locator will just assume that the local gateways are the better option UNTIL they fail, to encourage the system to stick with the local gateways even on startup. If there aren't actually any local gateways then the calling code will ignore this decision and do something else.

Base automatically changed from mgw-wanfed-followers to master January 22, 2021 16:03
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 25, 2021 21:11 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 25, 2021 21:11 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 25, 2021 21:34 Inactive
@vercel vercel bot temporarily deployed to Preview – consul January 25, 2021 21:34 Inactive
@rboyer rboyer merged commit c608dc0 into master Jan 25, 2021
@rboyer rboyer deleted the mgw-wanfed-startup branch January 25, 2021 23:30
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/317129.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit c608dc0 onto release/1.9.x succeeded!

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit c608dc0 onto release/1.8.x succeeded!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wanfed via mesh gateways in secondary datacenters should use gateways on startup if configured
3 participants