WIP Centralize SNI config #6237

banks · 2019-07-29T15:24:33Z

Currently we rely on "integration" code xDS package "knowing" how to build the right SNI for a given cluster. This is kinda gross because that same logic would need to be copied into any other proxy integration in the future. We should centralize it and so make integrations able to just consume it.

I spent a long time thinking about where semantically this should be exposed in our API. Initially it seemed like it could be exposed during service discovery, but we can't because the service discovery query itself doesn't know specifically which subset is being addressed currently. We also need to resolve exactly one SNI value for all instances per cluster which means it would really need to appear at the top-level of the response but our HTTP API currently is a list at the top level 😭 .

This means that tools that federate workloads from external systems will need to register the service instances in Consul and then create or modify the service-defaults for the service to set the ExternalSNI param such that outbound connections will get the correct (non-connect) SNI value set. For now we assume that any single service (even if there are multiple instances of that service or multiple gateways through which it can be accessed) will have a single SNI value to address them which seems inline with other mesh implementations to date.

TODO:

thread CompileRequest.CurrentTrustDomain through other tests and real call sites
add Compile tests for other permutations of overridden SNI
fix xDS package to use the new DiscoveryTarget.SNI field where it can. Probably needs some of the other SNI stuff put back to cover code paths not using discovery chain yet.
maybe fix other tests that broke in API etc?

TODO: - thread `CompileRequest.CurrentTrustDomain` through other tests and real call sites - add `Compile` tests for other permutations of overridden SNI - fix xDS package to use the new `DiscoveryTarget.SNI` field where it can. Probably needs some of the other SNI stuff put back to cover code paths not using discovery chain yet. - maybe fix other tests that broke in API etc?

rboyer · 2019-07-30T16:47:13Z

#6242 introduces a DiscoveryTargetConfig map in the compiler results. I think the target-specific value payloads like the SNI should likely end up there.

lkysow · 2019-08-13T12:02:50Z

agent/consul/discoverychain/compile.go

+}
+
+func (c *compiler)  sniForTarget(t structs.DiscoveryTarget, externalSNI string) string {
+	return connect.ServiceSNI(t.Service, t.ServiceSubset, t.Namespace,


Should this function return externalSNI if it's not ""?

Yes, I guess that is still a WIP part :)

mkeeler

I have a couple thoughts on this.

I think the correct integration point for proxies (including if we were to split the xDS stuff out of the Consul binary) would be to provide an agent endpoint to pull the generic proxy config that looks something like the proxycfg.ConfigSnapshot struct (probably not identical because that has a lot of tracking information in it for watches currently). This endpoint would need to return all the listener, cluster, endpoint and routing information that the proxy would need to configure itself.
Yes much of this will need to be put back in place for all the things which do not use the discovery chain (all of Mesh Gateways and prepared query upstreams to name two of them).

mkeeler · 2019-08-13T14:29:10Z

agent/consul/discoverychain/compile.go

+
+	// Work out if SNI have been overridden for this service
+	externalSNI := ""
+	if serviceDefaults != nil && serviceDefaults.ExternalSNI != "" {


Are we not going to allow modifying the SNI of something at a service instance level? I haven't thought through completely whether that would be good or bad (I have a hunch that instance level SNI is probably not a good UX).

Not for now because in our own model SNI needs to be varied based on the discovery path taken (e.g. to identify the intended subset) which an individual instance can't know about so it seems like it would get horribly confusing.

banks · 2019-08-14T10:28:35Z

Hey @mkeeler I'm trying to catch up with your thought process on those comments.

I think the correct integration point for proxies...

I think this is in response to "I spent a long time thinking about where semantically this should be exposed in our API.". The integration point for external proxies that want to support L7 features (including routing via gateways) we settled on is the new HTTP API RB added to the RC which is effectively a cleaned up Compile endpoint. That leaves proxy integrations doing a bunch of the proxycfg work themselves though. I think you have a good point that it would be a much cleaner integration API if we actually exposed the output of proxycfg directly so a proxy implementation basically gets the same interface that the existing XDS package has and all necessary info is present. One possible downside to this is that delivery over HTTP long poll could be costly since it's a potentially large result for a proxy with many upstreams and would need to be redelivered in entirety on every single upstream instance health status. This is all over localhost so might not be insurmountable but worth considering.

I think for now the HTTP API for the compiled chain at least makes it possible for external integrations to have equivalent functionality to Envoy. Long term I think we want to make that a more concrete plugin API that would not necessarily use HTTP and so could bypass the potential efficiency concern above. It would also need to treat each watched resource more independently than the whole proxy snapshot to avoid needing to resend the complete thing on each update though?

In the interest of making it possible to do this for the upcoming integration deadline, are you happy with moving forward with adding it here?

Yes much of this will need to be put back in place for all the things which do not use the discovery chain (all of Mesh Gateways and prepared query upstreams to name two of them).

Does your PR fix this RB?

mkeeler · 2019-08-14T16:20:40Z

@banks That was in response to the "how should we expose this in our API" bit that I saw.

The discovery chain is fine as an integration point for things that want to support L7 features. However if someone wanted to implement their own Mesh Gateway then that endpoint is useless. None of the Mesh Gateway stuff cares about the discovery chain (nor should it). Thats why exposing the proxycfg somehow seems like the correct place. Its one place that can do both normal proxies and mesh gateways.

I don't think that has a big impact on this PR (maybe none at all).

One problem I can see that the PR doesn't address yet (it is WIP) is how Mesh Gateways should route traffic with overridden SNI values.

How does the Mesh Gateway know what SNI value should route to a given service in its local DC? It could watch the service-defaults (doesn't currently need to) and do the same logic that you have already implemented in getGroupResolverNode. If so then that logic probably needs to be split out to somewhere not in the discovery chain.
For an overridden SNI value in the service-defaults, what is the expected interaction with subsets from a service-resolver? Should it just prepend a <subset>. to the overridden SNI value?
How can a Mesh Gateway know how to route traffic for an overridden SNI value between datacenters? Right now it uses wildcard rules that look for *.<datacenter>.internal.<trust domain>. This is probably the hardest problem of the 3 to solve. We could have exact matches for every overridden SNI value in remote datacenters and send them to the correct place. That would involve watching all the service-defaults and for every one of those that overrides the SNI then also watching that service in all datacenters. There is still one problem though which is how can we differentiate between that service running in one DC from that service running in another DC? I believe that this could end up in an endless gateway loop. Lets say service foo exists in dc1 and dc2 with some overridden SNI value. If something in dc1 attempts to contact foo in dc2 (maybe due to failover). The gateway in dc1 would see the SNI and route it towards dc2. The gateway in dc2 would see the SNI and turn it around and route towards dc1. This is because dc-local service SNIs are only considered after cross-dc wildcard SNIs (and potentially overridden SNIs for services in other dcs).

Maybe we just say that if you are using an external SNI then you cannot have the same service in multiple datacenters? Another alternative would be to define some sort of routing rules so that we could send a specific SNI to a specific datacenter even when it hasn't encoded the datacenter in the name.

rboyer · 2019-08-14T16:45:30Z

One piece of awkwardness about the compiler-based approach: the normal trust domain has to be known before the discovery chain compiler request is triggered. This means that the initial watch on the compiler output has to wait until we first receive the roots back from that notify call before we can even flight a request for the discovery chain. Since those two would be interlinked we'd need to track the CancelFunc for the watch should the roots change, like we already do for upstream queries when the chain changes.

banks · 2019-08-14T16:53:31Z

Good questions @mkeeler.

My thinking was that the real use case here was only for external services and that we'd never try to route them through our own mesh gateways. e.g. assume that anything with an ExternalSNI set is going directly to some SNI gateway outside of Consul. In that sense our gateways don't need to change at all.

That said, in the current formulation here there is nothing stopping you setting ExternalSNI on a regular service which does indeed have confusing semantics when it comes to gateways and routing.

I'd be happy to find the least-bad minimal change we can make so we don't have to support any customisation of SNI through mesh-gateways for now and still keep the UX not terrible. Perhaps we can validate that ExternalSNI is only set for external services somehow at config write time. Or perhaps we can define that setting that value explicitly declares that service to be external and fundamentally alters how we handle it - i.e. we explicit treat all instances returned by the resolver as external gateways so they never go via our own gateways and never failover etc?

@rboyer interesting. I don't have good ideas about how to fix that really. Do you?

banks requested a review from rboyer July 29, 2019 15:24

lkysow reviewed Aug 13, 2019

View reviewed changes

mkeeler reviewed Aug 13, 2019

View reviewed changes

rboyer mentioned this pull request Aug 14, 2019

connect: introduce ExternalSNI field on service-defaults #6324

Merged

banks closed this Aug 15, 2019

banks deleted the central-sni branch August 15, 2019 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Centralize SNI config #6237

WIP Centralize SNI config #6237

banks commented Jul 29, 2019 •

edited

Loading

rboyer commented Jul 30, 2019

lkysow Aug 13, 2019

banks Aug 14, 2019

mkeeler left a comment

mkeeler Aug 13, 2019

banks Aug 14, 2019

banks commented Aug 14, 2019

mkeeler commented Aug 14, 2019

rboyer commented Aug 14, 2019

banks commented Aug 14, 2019

WIP Centralize SNI config #6237

WIP Centralize SNI config #6237

Conversation

banks commented Jul 29, 2019 • edited Loading

rboyer commented Jul 30, 2019

lkysow Aug 13, 2019

Choose a reason for hiding this comment

banks Aug 14, 2019

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

mkeeler Aug 13, 2019

Choose a reason for hiding this comment

banks Aug 14, 2019

Choose a reason for hiding this comment

banks commented Aug 14, 2019

mkeeler commented Aug 14, 2019

rboyer commented Aug 14, 2019

banks commented Aug 14, 2019

banks commented Jul 29, 2019 •

edited

Loading