tls: add support for client-side session resumption. #4791

PiotrSikora · 2018-10-19T19:51:17Z

Risk Level: Low
Testing: bazel test //test/...
Docs Changes: Added
Release Notes: Added

Signed-off-by: Piotr Sikora piotrsikora@google.com

*Risk Level*: Low *Testing*: bazel test //test/... *Docs Changes*: Added *Release Notes*: Added Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora · 2018-10-19T19:52:30Z

cc @htuch @lizan @julia-stripe

PiotrSikora · 2018-10-19T19:53:48Z

source/common/ssl/context_impl.cc

+ if (max_session_keys_ > 0) {
+ absl::WriterMutexLock l(&session_keys_mu_);
+ if (!session_keys_.empty()) {
+ SSL_SESSION* session = session_keys_.front().get();


Note, similarly to @julia-stripe's PR, this assumes a single session store per cluster, not per endpoint.

@PiotrSikora can you talk a bit about the rationalization for that assumption?

The assumption is wrong most of the time (i.e. we should save session per endpoint, not per cluster), which is one of the reasons why this was sitting in my local tree until now.

However, the per cluster store works perfectly fine for (a) single-endpoint clusters, (b) deployments using shared cache and/or session tickets, where the same session can be resumed across all endpoints, so it's a stepping stone in the right direction, and per endpoint store can be added a bit later.

@PiotrSikora can you clarify what the next stones are towards per-endpoint stores? I think @julia-stripe and I were under the impression there was a single ClientContextImpl per endpoint, rather than per-cluster, which it sounds like isn't true. As you point out, this PR as stands won't work well for multi-endpoint clusters without shared session ticket keys, which is our big use-case.

@bobby-stripe it's either:

Storing sessions in the endpoint object, so that session's lifetime is the same as endpoint's.

Adding ability to configure "session cache key", so that sessions stored in ClientContextImpl can be retrieved by a specific key(s) (e.g. %UPSTREAM_CLUSTER% - which is effectively what we have right now, %UPSTREAM_HOST% or %REQ(:AUTHORITY)%).

I'm leaning towards the second option, since it's much more flexible, but I didn't have time to work on this yet.

cc @julia-stripe @ggreenway

@PiotrSikora how do you want to move forward with this? Merge this PR and then do 1/2 above? I'm guessing you get most of your wins by just doing (1) above.

I'd leaning towards (2), since it's much more flexible solution.

But yeah, let's definitely merge this PR as-is (feature-wise).

Can you open up a ticket for the per-endpoint continuation to track?

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

julia-stripe · 2018-10-22T14:30:59Z

api/envoy/api/v2/auth/cert.proto

+ // for TLSv1.2 and older) to store for the purpose of session resumption.
+ //
+ // Defaults to 1, setting this to 0 disables session resumption.
+ google.protobuf.UInt32Value max_session_keys = 4;


Would it make sense (in the future) to have a corresponding server-side setting for how many session tickets a TLS 1.3 server will issue at a time? the RFC says:

Servers that issue tickets SHOULD offer at least as many tickets as the number of connections that a client might use; for example, a web browser using HTTP/1.1 [RFC7230] might open six connections to a server.

Yes, but it's currently hardcoded to static const int kNumTickets = 2 in BoringSSL (see: https://boringssl.googlesource.com/boringssl/+/c0c9001440db8121bdc1ff1307b3a9aedf26fcd8/ssl/tls13_server.cc#165). cc @davidben

PiotrSikora · 2018-10-22T22:15:53Z

api/envoy/api/v2/auth/cert.proto

+ // for TLSv1.2 and older) to store for the purpose of session resumption.
+ //
+ // Defaults to 1, setting this to 0 disables session resumption.
+ google.protobuf.UInt32Value max_session_keys = 4;


Q: Should this feature be disabled or enabled by default?

Are there any security implications from enabling by default? E.g. are we materially increasing the amount of code that might be subject to compromise in BoringSSL etc. I have zero clue on this, but my inclination would be if there was a tradeoff to sacrifice performance (i.e. the resumption) for improved default security posture.

There is a case of a "privacy leak" (passive observer being able to correlate connections from the same user by looking at the session that's being resumed, which is sent unencrypted in ClientHello) in TLS versions older than 1.3, but that's mostly a threat to end-users (so also a single-user client-side proxy) and not middle/edge proxies and/or service mesh, so I don't think that it justifies having this off by default in Envoy.

Note: TLSv1.3 sends single-use sessions, so the default of 1 is probably too small. Perhaps we could make it vary by default, i.e. 1 if tls_maximum_protocol_version is smaller than TLSv1.3, and 4(?) otherwise.

I think the common case is service mesh and middle/edge proxies, so we should optimize for that. It's probably not great to be optimizing for TLS 1.3 quiet yet, I don't think this is universal by far.

Sorry, I somehow missed this comment earlier.

What I meant regarding TLS v1.3 is basically:

if upstream_tls_context.tls_params.tls_maximum_protocol_version == TLSv1_3: max_session_keys = 4; else max_session_keys = 1;

But I'm fine leaving the default at 1 for the time being, and we can revisit it later, when we enable TLSv1.3 by default.

stale · 2018-10-31T21:55:49Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora · 2018-11-08T10:13:49Z

@lizan @ggreenway could you take a look? Thanks!

lizan · 2018-11-08T23:07:35Z

source/common/ssl/context_impl.h

 const std::string server_name_indication_;
 const bool allow_renegotiation_;
+ const size_t max_session_keys_;
+ mutable absl::Mutex session_keys_mu_;
+ mutable std::deque<bssl::UniquePtr<SSL_SESSION>> session_keys_ GUARDED_BY(session_keys_mu_);


Make this mutable doesn't looks correct, should we just make newSsl non const? SslSocket holds non-const shared pointer so it should be OK.

Fixed, thanks!

…ent_session_reuse

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch

I'll step in to do a final pass. Some of my comments will show my ignorance of how BoringSSL works, so if you want to respond to them with more verbose code comments, that would make the life easier for the next neophyte who steps on this code :)

htuch · 2018-11-16T21:46:54Z

source/common/ssl/context_impl.cc

 if (!parsed_alpn_protocols_.empty()) {
 int rc = SSL_CTX_set_alpn_protos(ctx_.get(), &parsed_alpn_protocols_[0],
 parsed_alpn_protocols_.size());
 RELEASE_ASSERT(rc == 0, "");
 }
+
+ if (max_session_keys_ > 0) {
+ SSL_CTX_set_session_cache_mode(ctx_.get(), SSL_SESS_CACHE_CLIENT);


Should we check or ASSERT errors for these BoringSSL calls?

Not for those. SSL_CTX_set_session_cache_mode() cannot fail and returns previously configured mode, SSL_CTX_sess_set_new_cb() cannot fail and returns void.

htuch · 2018-11-16T21:49:22Z

source/common/ssl/context_impl.cc

+ if (max_session_keys_ > 0) {
+ absl::WriterMutexLock l(&session_keys_mu_);
+ if (!session_keys_.empty()) {
+ SSL_SESSION* session = session_keys_.front().get();


Can you open up a ticket for the per-endpoint continuation to track?

htuch · 2018-11-16T21:51:56Z

source/common/ssl/context_impl.cc

+ if (max_session_keys_ > 0) {
+ absl::WriterMutexLock l(&session_keys_mu_);
+ if (!session_keys_.empty()) {
+ SSL_SESSION* session = session_keys_.front().get();


Can you add some comments here on why picking the front of the queue is the right thing to do? I.e. why not the third item in the Q?

htuch · 2018-11-16T21:52:52Z

source/common/ssl/context_impl.cc

 return ssl_con;
 }

+int ClientContextImpl::newSessionKey(SSL_SESSION* session) {


I think we are safe for multi cert work, given that the client contexts will continue to have a single cert, but could you comment here on whether in the future, if we support multiple client certs, whether anything needs to change?

I don't think that we'll ever support multiple client certificates that can affect sessions, since client certificates are not revalidated during session resumption by the server.

In any case, this would be covered by #5073.

htuch · 2018-11-16T21:54:50Z

source/common/ssl/context_impl.cc

@@ -510,9 +520,31 @@ bssl::UniquePtr<SSL> ClientContextImpl::newSsl() const {
 SSL_set_renegotiate_mode(ssl_con.get(), ssl_renegotiate_freely);
 }

+ if (max_session_keys_ > 0) {
+ absl::WriterMutexLock l(&session_keys_mu_);


Sad that we need to take a writer mutex on a data path operation here. I assume that we're not that concerned because we expect connections to be relatively long lived. Is there a case for being able to take a reader mutex on the common path?

Well, we don't really need to take it. The alternative is to use per-worker lock-less session cache, but that would result in bigger memory usage and much lower hit rate, so I think that using shared cache is a good trade-off.

In theory, we only need write/write locks when we store single-use session keys (TLS 1.3), so I've added an optimization to use read/write locks for the other cases.

Thanks!

Thanks for the explanation and switching to reader mutex. I think this is fine for scalability, since this only impacts initial connection latency, not per request and also we'll eventually move to a shared connection pool model for scalability.

…ent_session_reuse

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch

LGTM, some final nits and we can ship, thanks.

htuch · 2018-11-19T17:57:08Z

test/common/ssl/ssl_socket_test.cc

+ true);
+ NiceMock<Network::MockListenerCallbacks> callbacks;
+ Network::MockConnectionHandler connection_handler;
+ DangerousDeprecatedTestTime test_time;


Do we need this? Can we use the simulated time_system above?

No, it just shows how old the code really is. Thanks!

htuch · 2018-11-19T17:58:21Z

test/common/ssl/ssl_socket_test.cc

+ client_connection->connect();
+
+ size_t connect_count = 0;
+ auto connectSecondTime = [&]() {


Nit: slight preference for explicit capture list here; I don't mind & in tests if it really makes them a lot less verbose (as opposed to regular code, where we should avoid wildcard), but here it doesn't help much.

Done. I updated those functions, as well as other lambdas in this file, but the list is a bit ridiculous, to be honest...

htuch · 2018-11-19T17:58:41Z

test/common/ssl/ssl_socket_test.cc

+ client_connection->connect();
+
+ size_t connect_count = 0;
+ auto connectSecondTime = [&]() {


Nit: this should be connect_second_time, as it's still a variable, albeit a lambda.

It's also a function, and connectSecondTime() looks nicer than connect_second_time(), IMHO.

But I changed it anyway.

htuch · 2018-11-19T17:59:20Z

test/common/ssl/ssl_socket_test.cc

+ }
+ };
+
+ EXPECT_CALL(client_connection_callbacks, onEvent(Network::ConnectionEvent::Connected))


Do you think we could InSequence these?

Done, but due to the ordering of events that depends on the version of the TLS protocol and whether or not the session was successfully resumed, this resulted in quite a lot of extra code.

See: 4aeefe4

htuch · 2018-11-19T17:59:41Z

test/common/ssl/ssl_socket_test.cc

+ testClientSessionResumption(server_ctx_yaml, client_ctx_yaml, true, GetParam());
+}
+
+TEST_P(SslSocketTest, ClientSessionResumptionEnabledTls13) {


Nit: prefer a // one liner explaining in plain text what all these tests do.

…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch

LGTM, thanks!

htuch · 2018-11-23T21:48:36Z

test/common/ssl/ssl_socket_test.cc

- auto stopSecondTime = [&]() {
- if (++counter == 2) {
+ size_t connect_count = 0;
+ auto connect_second_time = [&connect_count, &dispatcher, &server_connection, &client_connection,


Actually, I agree it's a bit ridiculous to have explicit capture lists in tests when it gets this long. My rule of thumb is in production code, make it explicit (avoid unintended mistakes that can creep in) and in test code, make the capture list explicit when it's short, otherwise you can wildcard it. I had though in the example I pointed at it would only be two items long, but I guess that's not the case across this file.

Should I revert it or leave it as-is, then? I'm fine with either.

Yeah, if you could revert back the ones that are actually ridiculous (and leave the shorter ones as is) that would be great.

htuch · 2018-11-23T21:50:26Z

@PiotrSikora coverage fail seem legitimate, not sure why this pure virtual method call is happening, can you take a look?

PiotrSikora · 2018-11-23T22:38:39Z

It's not legitimate, if you look at results of coverage just before fix_format, it was fine: https://circleci.com/gh/envoyproxy/envoy/126542, I'll re-kick the CI.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch · 2018-11-26T16:16:40Z

@PiotrSikora thanks for looking into the failure. I will keep an eye out for these being on Envoy maintainer duty this week.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

…ent_session_reuse

htuch

Thanks!

mattklein123 · 2018-11-27T14:08:09Z

@PiotrSikora I think we missed the release note for this. Do you mind doing a follow up PR with a release note? Thank you.

Risk Level: Low Testing: bazel test //test/... Signed-off-by: Piotr Sikora <piotrsikora@google.com> Signed-off-by: Fred Douglas <fredlas@google.com>

tls: add support for client-side session resumption.

a5fd6ed

*Risk Level*: Low *Testing*: bazel test //test/... *Docs Changes*: Added *Release Notes*: Added Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora commented Oct 19, 2018

View reviewed changes

review: add tests.

ea18475

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

julia-stripe mentioned this pull request Oct 22, 2018

tls: Implement session resumption for client connections #4790

Closed

julia-stripe reviewed Oct 22, 2018

View reviewed changes

dnoe assigned lizan Oct 22, 2018

PiotrSikora commented Oct 22, 2018

View reviewed changes

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 31, 2018

Merge remote-tracking branch 'origin/master' into PiotrSikora/tls_cli…

b93ad39

…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 5, 2018

lizan reviewed Nov 8, 2018

View reviewed changes

PiotrSikora added 2 commits November 14, 2018 18:38

Merge remote-tracking branch 'origin/master' into PiotrSikora/tls_cli…

b7b3832

…ent_session_reuse

review: de-constify newSsl.

8512db4

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

lizan previously approved these changes Nov 16, 2018

View reviewed changes

htuch suggested changes Nov 16, 2018

View reviewed changes

htuch self-assigned this Nov 16, 2018

PiotrSikora mentioned this pull request Nov 16, 2018

Add ability to configure TLS session cache key #5073

Open

PiotrSikora added 3 commits November 16, 2018 22:57

Merge remote-tracking branch 'origin/master' into PiotrSikora/tls_cli…

b698a5a

…ent_session_reuse

review: add comments about LIFO.

9aea0b4

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: use write/write locks only when necessary.

fd556b8

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora dismissed lizan’s stale review via fd556b8 November 17, 2018 01:12

htuch suggested changes Nov 19, 2018

View reviewed changes

PiotrSikora added 5 commits November 22, 2018 00:37

Merge remote-tracking branch 'origin/master' into PiotrSikora/tls_cli…

18404f4

…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: add one-line comments to describe test cases.

e21ab80

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: use Event::SimulatedTimeSystem.

e4a754b

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: rename lambdas and use explicit capture list.

5f31540

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: use InSequence.

4aeefe4

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: fix format.

a09a319

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch previously approved these changes Nov 23, 2018

View reviewed changes

PiotrSikora added 2 commits November 23, 2018 22:39

review: Kick CI.

63f2da9

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

review: Kick CI.

8eb4028

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora added 2 commits November 26, 2018 18:47

review: revert ridiculous lambda captures.

e8da108

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

Merge remote-tracking branch 'origin/master' into PiotrSikora/tls_cli…

5f13c14

…ent_session_reuse

PiotrSikora dismissed htuch’s stale review via 5f13c14 November 26, 2018 18:57

htuch approved these changes Nov 26, 2018

View reviewed changes

htuch merged commit 97fa885 into envoyproxy:master Nov 26, 2018

amukherj1 mentioned this pull request Sep 23, 2020

Support TLS session resumption when originating TLS connections #3817

Closed

lambdai mentioned this pull request Apr 12, 2024

Improve client side TLS session reuse #33512

Closed

tls: add support for client-side session resumption. #4791

tls: add support for client-side session resumption. #4791

Conversation

PiotrSikora commented Oct 19, 2018

PiotrSikora commented Oct 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Oct 31, 2018

PiotrSikora commented Nov 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch commented Nov 23, 2018

PiotrSikora commented Nov 23, 2018

htuch commented Nov 26, 2018

htuch left a comment

Choose a reason for hiding this comment

mattklein123 commented Nov 27, 2018