Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls: add support for client-side session resumption. #4791

Merged
merged 18 commits into from
Nov 26, 2018

Conversation

PiotrSikora
Copy link
Contributor

Risk Level: Low
Testing: bazel test //test/...
Docs Changes: Added
Release Notes: Added

Signed-off-by: Piotr Sikora piotrsikora@google.com

*Risk Level*: Low
*Testing*: bazel test //test/...
*Docs Changes*: Added
*Release Notes*: Added

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@PiotrSikora
Copy link
Contributor Author

cc @htuch @lizan @julia-stripe

if (max_session_keys_ > 0) {
absl::WriterMutexLock l(&session_keys_mu_);
if (!session_keys_.empty()) {
SSL_SESSION* session = session_keys_.front().get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, similarly to @julia-stripe's PR, this assumes a single session store per cluster, not per endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PiotrSikora can you talk a bit about the rationalization for that assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is wrong most of the time (i.e. we should save session per endpoint, not per cluster), which is one of the reasons why this was sitting in my local tree until now.

However, the per cluster store works perfectly fine for (a) single-endpoint clusters, (b) deployments using shared cache and/or session tickets, where the same session can be resumed across all endpoints, so it's a stepping stone in the right direction, and per endpoint store can be added a bit later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PiotrSikora can you clarify what the next stones are towards per-endpoint stores? I think @julia-stripe and I were under the impression there was a single ClientContextImpl per endpoint, rather than per-cluster, which it sounds like isn't true. As you point out, this PR as stands won't work well for multi-endpoint clusters without shared session ticket keys, which is our big use-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobby-stripe it's either:

  1. Storing sessions in the endpoint object, so that session's lifetime is the same as endpoint's.
  2. Adding ability to configure "session cache key", so that sessions stored in ClientContextImpl can be retrieved by a specific key(s) (e.g. %UPSTREAM_CLUSTER% - which is effectively what we have right now, %UPSTREAM_HOST% or %REQ(:AUTHORITY)%).

I'm leaning towards the second option, since it's much more flexible, but I didn't have time to work on this yet.

cc @julia-stripe @ggreenway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PiotrSikora how do you want to move forward with this? Merge this PR and then do 1/2 above? I'm guessing you get most of your wins by just doing (1) above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leaning towards (2), since it's much more flexible solution.

But yeah, let's definitely merge this PR as-is (feature-wise).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open up a ticket for the per-endpoint continuation to track?

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
// for TLSv1.2 and older) to store for the purpose of session resumption.
//
// Defaults to 1, setting this to 0 disables session resumption.
google.protobuf.UInt32Value max_session_keys = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense (in the future) to have a corresponding server-side setting for how many session tickets a TLS 1.3 server will issue at a time? the RFC says:

Servers that issue tickets SHOULD offer at least as many tickets as the number of connections that a client might use; for example, a web browser using HTTP/1.1 [RFC7230] might open six connections to a server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's currently hardcoded to static const int kNumTickets = 2 in BoringSSL (see: https://boringssl.googlesource.com/boringssl/+/c0c9001440db8121bdc1ff1307b3a9aedf26fcd8/ssl/tls13_server.cc#165). cc @davidben

// for TLSv1.2 and older) to store for the purpose of session resumption.
//
// Defaults to 1, setting this to 0 disables session resumption.
google.protobuf.UInt32Value max_session_keys = 4;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should this feature be disabled or enabled by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any security implications from enabling by default? E.g. are we materially increasing the amount of code that might be subject to compromise in BoringSSL etc. I have zero clue on this, but my inclination would be if there was a tradeoff to sacrifice performance (i.e. the resumption) for improved default security posture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case of a "privacy leak" (passive observer being able to correlate connections from the same user by looking at the session that's being resumed, which is sent unencrypted in ClientHello) in TLS versions older than 1.3, but that's mostly a threat to end-users (so also a single-user client-side proxy) and not middle/edge proxies and/or service mesh, so I don't think that it justifies having this off by default in Envoy.

Note: TLSv1.3 sends single-use sessions, so the default of 1 is probably too small. Perhaps we could make it vary by default, i.e. 1 if tls_maximum_protocol_version is smaller than TLSv1.3, and 4(?) otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the common case is service mesh and middle/edge proxies, so we should optimize for that. It's probably not great to be optimizing for TLS 1.3 quiet yet, I don't think this is universal by far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I somehow missed this comment earlier.

What I meant regarding TLS v1.3 is basically:

if upstream_tls_context.tls_params.tls_maximum_protocol_version == TLSv1_3:
   max_session_keys = 4;
else
   max_session_keys = 1;

But I'm fine leaving the default at 1 for the time being, and we can revisit it later, when we enable TLSv1.3 by default.

@stale
Copy link

stale bot commented Oct 31, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 31, 2018
…ent_session_reuse

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 5, 2018
@PiotrSikora
Copy link
Contributor Author

@lizan @ggreenway could you take a look? Thanks!

const std::string server_name_indication_;
const bool allow_renegotiation_;
const size_t max_session_keys_;
mutable absl::Mutex session_keys_mu_;
mutable std::deque<bssl::UniquePtr<SSL_SESSION>> session_keys_ GUARDED_BY(session_keys_mu_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this mutable doesn't looks correct, should we just make newSsl non const? SslSocket holds non-const shared pointer so it should be OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
lizan
lizan previously approved these changes Nov 16, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll step in to do a final pass. Some of my comments will show my ignorance of how BoringSSL works, so if you want to respond to them with more verbose code comments, that would make the life easier for the next neophyte who steps on this code :)

if (!parsed_alpn_protocols_.empty()) {
int rc = SSL_CTX_set_alpn_protos(ctx_.get(), &parsed_alpn_protocols_[0],
parsed_alpn_protocols_.size());
RELEASE_ASSERT(rc == 0, "");
}

if (max_session_keys_ > 0) {
SSL_CTX_set_session_cache_mode(ctx_.get(), SSL_SESS_CACHE_CLIENT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check or ASSERT errors for these BoringSSL calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for those. SSL_CTX_set_session_cache_mode() cannot fail and returns previously configured mode, SSL_CTX_sess_set_new_cb() cannot fail and returns void.

if (max_session_keys_ > 0) {
absl::WriterMutexLock l(&session_keys_mu_);
if (!session_keys_.empty()) {
SSL_SESSION* session = session_keys_.front().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open up a ticket for the per-endpoint continuation to track?

if (max_session_keys_ > 0) {
absl::WriterMutexLock l(&session_keys_mu_);
if (!session_keys_.empty()) {
SSL_SESSION* session = session_keys_.front().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments here on why picking the front of the queue is the right thing to do? I.e. why not the third item in the Q?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return ssl_con;
}

int ClientContextImpl::newSessionKey(SSL_SESSION* session) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are safe for multi cert work, given that the client contexts will continue to have a single cert, but could you comment here on whether in the future, if we support multiple client certs, whether anything needs to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we'll ever support multiple client certificates that can affect sessions, since client certificates are not revalidated during session resumption by the server.

In any case, this would be covered by #5073.

@@ -510,9 +520,31 @@ bssl::UniquePtr<SSL> ClientContextImpl::newSsl() const {
SSL_set_renegotiate_mode(ssl_con.get(), ssl_renegotiate_freely);
}

if (max_session_keys_ > 0) {
absl::WriterMutexLock l(&session_keys_mu_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad that we need to take a writer mutex on a data path operation here. I assume that we're not that concerned because we expect connections to be relatively long lived. Is there a case for being able to take a reader mutex on the common path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we don't really need to take it. The alternative is to use per-worker lock-less session cache, but that would result in bigger memory usage and much lower hit rate, so I think that using shared cache is a good trade-off.

In theory, we only need write/write locks when we store single-use session keys (TLS 1.3), so I've added an optimization to use read/write locks for the other cases.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and switching to reader mutex. I think this is fine for scalability, since this only impacts initial connection latency, not per request and also we'll eventually move to a shared connection pool model for scalability.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some final nits and we can ship, thanks.

true);
NiceMock<Network::MockListenerCallbacks> callbacks;
Network::MockConnectionHandler connection_handler;
DangerousDeprecatedTestTime test_time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Can we use the simulated time_system above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it just shows how old the code really is. Thanks!

client_connection->connect();

size_t connect_count = 0;
auto connectSecondTime = [&]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: slight preference for explicit capture list here; I don't mind & in tests if it really makes them a lot less verbose (as opposed to regular code, where we should avoid wildcard), but here it doesn't help much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I updated those functions, as well as other lambdas in this file, but the list is a bit ridiculous, to be honest...

client_connection->connect();

size_t connect_count = 0;
auto connectSecondTime = [&]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be connect_second_time, as it's still a variable, albeit a lambda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a function, and connectSecondTime() looks nicer than connect_second_time(), IMHO.

But I changed it anyway.

}
};

EXPECT_CALL(client_connection_callbacks, onEvent(Network::ConnectionEvent::Connected))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could InSequence these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but due to the ordering of events that depends on the version of the TLS protocol and whether or not the session was successfully resumed, this resulted in quite a lot of extra code.

See: 4aeefe4

testClientSessionResumption(server_ctx_yaml, client_ctx_yaml, true, GetParam());
}

TEST_P(SslSocketTest, ClientSessionResumptionEnabledTls13) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer a // one liner explaining in plain text what all these tests do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…ent_session_reuse

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
htuch
htuch previously approved these changes Nov 23, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

auto stopSecondTime = [&]() {
if (++counter == 2) {
size_t connect_count = 0;
auto connect_second_time = [&connect_count, &dispatcher, &server_connection, &client_connection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I agree it's a bit ridiculous to have explicit capture lists in tests when it gets this long. My rule of thumb is in production code, make it explicit (avoid unintended mistakes that can creep in) and in test code, make the capture list explicit when it's short, otherwise you can wildcard it. I had though in the example I pointed at it would only be two items long, but I guess that's not the case across this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I revert it or leave it as-is, then? I'm fine with either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you could revert back the ones that are actually ridiculous (and leave the shorter ones as is) that would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@htuch
Copy link
Member

htuch commented Nov 23, 2018

@PiotrSikora coverage fail seem legitimate, not sure why this pure virtual method call is happening, can you take a look?

@PiotrSikora
Copy link
Contributor Author

It's not legitimate, if you look at results of coverage just before fix_format, it was fine: https://circleci.com/gh/envoyproxy/envoy/126542, I'll re-kick the CI.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@htuch
Copy link
Member

htuch commented Nov 26, 2018

@PiotrSikora thanks for looking into the failure. I will keep an eye out for these being on Envoy maintainer duty this week.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit 97fa885 into envoyproxy:master Nov 26, 2018
@mattklein123
Copy link
Member

@PiotrSikora I think we missed the release note for this. Do you mind doing a follow up PR with a release note? Thank you.

fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
Risk Level: Low
Testing: bazel test //test/...

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants