-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: handle re-bootstrapping in a secondary datacenter when WAN federation via mesh gateways is configured #7931
Conversation
…eration via mesh gateways is configured The main fix here is to always union the `primary-gateways` list with the list of mesh gateways in the primary returned from the replicated federation states list. This will allow any replicated (incorrect) state to be supplemented with user-configured (correct) state in the config file. Eventually the game of random selection whack-a-mole will pick a winning entry and re-replicate the latest federation states from the primary. If the user-configured state is actually the incorrect one, then the same eventual correct selection process will work in that case, too. The secondary fix is actually to finish making wanfed-via-mgws actually work as originally designed. Once a secondary datacenter has replicated federation states for the primary AND managed to stand up its own local mesh gateways then all of the RPCs from a secondary to the primary SHOULD go through two sets of mesh gateways to arrive in the consul servers in the primary (one hop for the secondary datacenter's mesh gateway, and one hop through the primary datacenter's mesh gateway). This was neglected in the initial implementation. While everything works, ideally we should treat communications that go around the mesh gateways as just provided for bootstrapping purposes. Now we heuristically use the success/failure history of the federation state replicator goroutine loop to determine if our current mesh gateway route is working as intended. If it is, we try using the local gateways, and if those don't work we fall back on trying the primary via the union of the replicated state and the go-discover configuration flags. This can be improved slightly in the future by possibly initializing the gateway choice to local on startup if we already have replicated state. This PR does not address that improvement. Fixes #7339
@@ -133,7 +238,6 @@ func getRandomItem(items []string) string { | |||
|
|||
type serverDelegate interface { | |||
blockingQuery(queryOpts structs.QueryOptionsCompat, queryMeta structs.QueryMetaCompat, fn queryFn) error | |||
PrimaryGatewayFallbackAddresses() []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A stray piece of dead code leftover from an earlier refactoring session.
// and does a zipper merge of the two sorted slices, removing any cross-slice | ||
// duplicates. If any individual slice contained duplicates those will be | ||
// retained. | ||
func StringSliceMergeSorted(a, b []string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lib
package seems to violate the recommended practice for package names: blog.golang.org bad package names.
I think we should avoid adding anything new to the lib
package, and attempt to move things out whenever possible.
How likely do you think it is that this function gets re-used? Maybe it can be unexported next to the one caller ?
Alternatively maybe it could go into a lib/stringsort
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with you in general, I'd like to avoid doing a refactor like that at the same time as implementing a feature. There are more functions that just this one that would end up in a package like lib/stringslice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7934 (to be merged after this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…eration via mesh gateways is configured (#7931) The main fix here is to always union the `primary-gateways` list with the list of mesh gateways in the primary returned from the replicated federation states list. This will allow any replicated (incorrect) state to be supplemented with user-configured (correct) state in the config file. Eventually the game of random selection whack-a-mole will pick a winning entry and re-replicate the latest federation states from the primary. If the user-configured state is actually the incorrect one, then the same eventual correct selection process will work in that case, too. The secondary fix is actually to finish making wanfed-via-mgws actually work as originally designed. Once a secondary datacenter has replicated federation states for the primary AND managed to stand up its own local mesh gateways then all of the RPCs from a secondary to the primary SHOULD go through two sets of mesh gateways to arrive in the consul servers in the primary (one hop for the secondary datacenter's mesh gateway, and one hop through the primary datacenter's mesh gateway). This was neglected in the initial implementation. While everything works, ideally we should treat communications that go around the mesh gateways as just provided for bootstrapping purposes. Now we heuristically use the success/failure history of the federation state replicator goroutine loop to determine if our current mesh gateway route is working as intended. If it is, we try using the local gateways, and if those don't work we fall back on trying the primary via the union of the replicated state and the go-discover configuration flags. This can be improved slightly in the future by possibly initializing the gateway choice to local on startup if we already have replicated state. This PR does not address that improvement. Fixes #7339
The main fix here is to always union the
primary-gateways
list withthe list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.
The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.
Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.
This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.
Fixes #7339