Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address SNI binding and bypass for TLS listeners #72

Closed
jpeach opened this issue Feb 11, 2020 · 25 comments
Closed

Address SNI binding and bypass for TLS listeners #72

jpeach opened this issue Feb 11, 2020 · 25 comments
Assignees
Labels
documentation Improvements or additions to documentation kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Milestone

Comments

@jpeach
Copy link
Contributor

jpeach commented Feb 11, 2020

What would you like to be added:

In the current API, certificates are a property of the Listener and Routes are a property of the Gateway. Since there is no direct relationship between the certificates and the backend service (i.e. the Route), this structure encourages the separation of SNI routing from HTTP request routing. This can lead to a number of different security issues.

Why is this needed:

Operators need the ability to pin the set of certificates to be used for a specific backend service. That is, for a TLS session bound to SNI name foo.example.com, always forward HTTP requests to the backend session that is bound to the foo.example.com host name.

For a more thorough discussion of SNI bypass, see https://hal.inria.fr/hal-01202712/document

xref #49

@jpeach jpeach added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 11, 2020
@jpeach
Copy link
Contributor Author

jpeach commented Feb 11, 2020

Also RFC 6066, which strongly suggests that the HTTP host should be bound to the SNI name.

Since it is possible for a client to present a different server_name
in the application protocol, application server implementations that
rely upon these names being the same MUST check to make sure the
client did not present a different name in the application protocol.

@bowei
Copy link
Contributor

bowei commented Feb 13, 2020

How do the major proxy implementations handle this?
Is there possibly a use case to enable SNI/HTTP Host mismatch (I'm guessing no)
Is this dependent on how you set up your config (e.g. if we mandate such behavior, is it possible to achieve)

@youngnick
Copy link
Contributor

Contour intends to stop this from happening, but in practice it's very easy to do it accidentally, via implementation details.

There are some use cases to do this (basically, allowing Domain fronting and similar use cases), but I think that they are specialised enough that we should default to not supporting that and figure it out later.

The key thing here is that, if we include the possibility of client certificates as part of the TLS config, SNI/HTTP Host mismatch is a big security risk. If you allow the TLS negotation to succeed for host foo, including the client certificate exchange, and that exchange is being used for authentication, then allowing the inner HTTP transaction to go to a different host completely bypasses it.

The use case here is that I'm using a Gateway for ultrasecure.foo.com, which requires TLS with a client certificate, and the same Gateway for insecure.foo.com, which requires a serving TLS only. TLS is terminated at the Gateway, and may be reencrypted afterwards.

In this case, requests with an SNI of insecure.foo.com will complete the TLS with the proxy, and, if you have a Host header of ultrasecure.foo.com, you will totally bypass the client certificate step.

I would argue this is very unexpected behaviour that creates a big problem for a user that they may not realise is a problem.

I think this problem is dependent on how we set up the constructs. As @jpeach said earlier, there needs to be a tight binding between TLS certificate config and the allowed hostnames associated with that certificate, for HTTP Routes. If we're talking about TCP with SNI routing only, then there is no Host, and thus this problem is not visible at this layer. (The service forwarded to may have it, but that is not our responsibility).

@bowei
Copy link
Contributor

bowei commented Feb 14, 2020

It's not clear why this is not an artifact of the a specific implementation as opposed to something that surfaces at the API layer.
For example, if your proxy carried the SNI field as context to the HTTP protocol processor, it can reject such mismatches if configured to do so.

Can you sketch what the Kubernetes API validation would look like that would prevent this?
Or is the statement that the user can guarantee such a thing cannot happen by crafting their configuration with a given shape.
There is nothing to prevent users from configuring a server cert that doesn't match their HTTPRoute, even within a Listener block.
Is the suggestion that we parse the cert and do validation on each of the configured domains?

As an aside, it seems like in for the ultrasecure.com vs insecure.com use case, we would recommend to users to create two Gateways or even separate GatewayClasses that correspond to completely isolated infrastructures. From a compliance perspective (e.g. PCI), this would be pretty much mandated.

@jpeach
Copy link
Contributor Author

jpeach commented Feb 16, 2020

The first thing that we need in the issue is agreement that binding the HTTP Host/Authority to the certificate is the correct behavior. My contention that this is the case is supported by RFC 6066 and RFC 2818

Now, as to the API implications of this, I think that the first step is for the API to be able express the relationship between a virtual host and its TLS configuration. Once we can do this, then controller implementation has been told the users intention, and can do a lot more to ensure the correct results.

@bowei
Copy link
Contributor

bowei commented Mar 4, 2020

Example: every certificate comes with the set of hostnames that SHOULD be used

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2020
@robscott
Copy link
Member

robscott commented Jun 3, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2020
@szuecs
Copy link

szuecs commented Sep 2, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 2, 2020
@robscott
Copy link
Member

@jpeach Do we still need to do anything here or did the Gateway restructuring help? If so, what would we need for v1alpha1?

@robscott
Copy link
Member

Consensus seems to be we need to add some documentation here.

@hbagdi hbagdi added the documentation Improvements or additions to documentation label Nov 13, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2021
@robscott
Copy link
Member

robscott commented Apr 8, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 8, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2021
@hbagdi
Copy link
Contributor

hbagdi commented Jul 8, 2021

@jpeach @youngnick It seems that we have consensus here and we need to add documentation. Since both of you have the most context here, do either of you mind taking this up and drive it to completion?

@youngnick
Copy link
Contributor

I can take this one.

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2021
@bowei
Copy link
Contributor

bowei commented Aug 16, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 16, 2021
@hbagdi
Copy link
Contributor

hbagdi commented Aug 16, 2021

@robscott @youngnick While this is a clarification, can this be considered a borderline breaking change by any chance?
I ask to understand if this is something to pull into v1a2 or not.

@youngnick
Copy link
Contributor

I agree that we should sort this out in v1a2, since it could be a breaking change. I'll add it to the milestone to make sure we don't miss it.

@youngnick youngnick added this to the v1alpha2 milestone Aug 18, 2021
@robscott
Copy link
Member

/assign @youngnick

@youngnick
Copy link
Contributor

Updating this with a bit more context.

I'm going to quote from Contour's docs here, since @jpeach wrote this very precisely:

The HTTP/2 specification allows user agents (browsers) to re-use TLS sessions to different hostnames as long as they share an IP address and a TLS server certificate (see RFC 7540). Sharing a TLS certificate typically uses a wildcard certificate, or a certificate containing multiple alternate names. If this kind of session reuse is not supported by the server, it sends a “421 Misdirected Request”, and the user agent may retry the request with a new TLS session.

If we don't enforce a very tight binding between the SNI hostname and the HTTP hostname, then it's possible to trivially circumvent any certificate-based authentication at least in Envoy's implementation, like this:

  • Alice stands up secure.example.net, using a full TLS handshake, including required client certificates (yes, I know we don't have a way to specify that yet). This routes to a backend service via a HTTPRoute with the hostname secure.example.net.
  • Eve stands up gotcha.example.net, using a server-cert only TLS handshake, but specifies a HTTPRoute with a hostname secure.example.net.
    Depending on how the underlying implementation handles HTTPRoutes that route to the same hostname, Eve's gotcha service could end up with no, some, or all traffic bound for secure.example.net.

Now, there are legitimate uses for this kind of domain fronting, mainly around caching and content distribution, but those uses are pretty niche. In almost all cases, if you are specifying a terminated TLS session via a HTTPRoute, those hostnames should match (in the domain-name match sense we already have defined in the spec). So, by mandating this, we remove some weird security things, and make the API behave more like people would expect.

I think this comes under the heading of removing weird edge cases that most people won't want until someone actually asks for them.

@youngnick
Copy link
Contributor

After more consideration and rereading of #839, I think that we can probably close this out now, since we are mandating that the TLS hostname and the Listener hostname must match. Thanks for pointing this out, @hbagdi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants