Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wan federation via mesh gateways re-bootstrapping #7339

Closed
rboyer opened this issue Feb 21, 2020 · 1 comment · Fixed by #7931
Closed

wan federation via mesh gateways re-bootstrapping #7339

rboyer opened this issue Feb 21, 2020 · 1 comment · Fixed by #7931
Assignees
Labels
post-beta theme/federation-usability Anything related to Federation
Milestone

Comments

@rboyer
Copy link
Member

rboyer commented Feb 21, 2020

There's a recovery-type situation for wanfed via gateways that we probably need to have a story around:

  1. user initializes with the -primary-gateway args
  2. system uses fallback addresses from (1) to do the initial wan join
  3. after wan join completes and the rest of the secondary bootstrapping happens we can replicate federation states
  4. now we can exclusively use the primary's federation state list of gateways instead of the fallback.

But...what if there was some non-graceful rotation of the primary gateways that didn't give the secondaries time to replicate at least one of the new mesh-gateway addresses? Then the secondary replicated federation state information is useless AND since having any primary federation state trumps the flag/config fallback data it will try to use gateways that don't exist "forever".

Probably need some sort of active process in the GatewayLocator that does some simple light RPC (status check?) to the wildcard primary server address through each mesh-gateway listed in the state store for the primary periodically. Need to heuristically determine if the state machine gateways are still usable, and if not goes back to using the fallback ones until the state machines ones work again (presumably after we can hook up federation state replication streams to the primary again).

@rboyer rboyer added this to the 1.8.0 milestone Feb 21, 2020
@rboyer rboyer self-assigned this Feb 21, 2020
@oleksiyp
Copy link

oleksiyp commented Mar 7, 2020

A great feature to have. Is there any ETA for associated PR to be merged and released?

@rboyer rboyer added the theme/federation-usability Anything related to Federation label Mar 9, 2020
rboyer added a commit that referenced this issue May 19, 2020
… the contents of the primary-gateways flag

Fixes #7339
rboyer added a commit that referenced this issue May 20, 2020
…eration via mesh gateways is configured

The main fix here is to always union the `primary-gateways` list with
the list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.

The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.

Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.

This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.

Fixes #7339
rboyer added a commit that referenced this issue May 27, 2020
…eration via mesh gateways is configured (#7931)

The main fix here is to always union the `primary-gateways` list with
the list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.

The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.

Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.

This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.

Fixes #7339
hashicorp-ci pushed a commit that referenced this issue May 27, 2020
…eration via mesh gateways is configured (#7931)

The main fix here is to always union the `primary-gateways` list with
the list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.

The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.

Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.

This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.

Fixes #7339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
post-beta theme/federation-usability Anything related to Federation
Projects
None yet
3 participants