Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tls resume PoC (server stateless) #57079

Merged
merged 15 commits into from
Sep 3, 2021
Merged

Tls resume PoC (server stateless) #57079

merged 15 commits into from
Sep 3, 2021

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Aug 9, 2021

contributes to #22977
fixes #49845

This still needs some more testing and cleanup but it would be great to get some early feedback @bartonjs and @stephentoub

With OpenSSL the general strategy is to create SSL_CTX and set some parameters. Then that is used to create individual SSL sessions. That does not really fit into .NET API surface where one create SslStream and then AuthenticateAs* is called with varying parameters. For that reason (most likely) we always create new SSL_CTX for each stream. Aside from doing more work, this prevents TLS resume to work as the session cache and whatever is needed is linked to SSL_CTX.

This PR is trying to reuse SSL_CTX in some cases when it seems to be safe to do so.
Here is impact on connectionclosehttps benchmark from ASP.NET on Linux.
RPS jumps ~ 21x.

crank compare main.json my-release.json

| load                   | main   | my-release |            |
| ---------------------- | ------ | ---------- | ---------- |
| CPU Usage (%)          |     99 |         96 |     -3.03% |
| Cores usage (%)        |  1,183 |      1,149 |     -2.87% |
| Working Set (MB)       |     84 |         49 |    -41.67% |
| Private Memory (MB)    |    435 |        357 |    -17.93% |
| Start Time (ms)        |      0 |          0 |            |
| First Request (ms)     |    246 |        273 |    +10.98% |
| Requests/sec           |    774 |     16,824 | +2,073.92% |
| Requests               | 11,685 |    253,760 | +2,071.67% |
| Mean latency (ms)      |   4.06 |       0.74 |    -81.82% |
| Max latency (ms)       |  31.46 |      36.03 |    +14.53% |
| Bad responses          |      0 |          0 |            |
| Socket errors          |      0 |          0 |            |
| Read throughput (MB/s) |   0.11 |       2.42 | +2,071.47% |
| Latency 50th (ms)      |   2.19 |       0.43 |    -80.27% |
| Latency 75th (ms)      |   6.79 |       0.57 |    -91.61% |
| Latency 90th (ms)      |  10.65 |       0.86 |    -91.92% |
| Latency 99th (ms)      |  16.59 |       9.32 |    -43.82% |

Windows runs always timed out for me. With #49845 claiming 15K RPS we should be on par with 16K+ unless there was another jump with 6.0. (I'll try to follow up on this with @sebastienros

At this point .NET client still does not work and for that reason performance microbenchmarks are not available.
The reason what ASP.NET shows improvement is because it is using different tool. (wrk)

Since it is getting late I'm wondering if we should add AptContext switch and ENV variable to be able to opt-out in case session reuse create problem for somebody. (not done in this PR)

TL;DR

This is somewhat arbitrary split but I decided to loosely follow Credentials lookup and use certificate, enabled TLS protocols and EncryptionPolicy as unique combination.

Since EncryptionPolicy.NoEncryption and EncryptionPolicy.AllowNoEncryption are somewhat lame and obscure I decided to avoid them for purpose of resume and I use protocols as key to cache attached to SslCertificateContext e.g. unique server certificate. When SSL_CTX exist for given combination we would use it instead of creating new one when creating SafeSslHandle.

With that, many operation we did on SSL_CTX needs to move to SSL itself. Luckily OpenSSL have basically everyone in pars e.g. each operation can be done on either one. There is lot of turn in PAL code but that should mostly be 1:1 change. I also try to use consistently naming CryptoNative_SslCtx* for operations working on SSL_CTX to make it more obvious.

The notable exception is SSL_CTX_set_alpn_select_cb that can be set only on SSL_CTX and there is no matching SSL_set_alpn_select_cb. Since the logic is common I use SslSetData and SslGetData to attache the specific ALPN list to SSL itself and I grab it again in AlpnServerSelectCallback.
That seems to work quite well.
The resining problem is that once the callback is sett it MUST choose ALPN and it cannot wet run NULL.
That creates problem for cases when ALPN is not requested. I don't know how common that is but for now I exclude that from cache lookup. That can be easily fixed if needed either be factoring ALPN to cache lookup or simply by creating parallel cache. (would cost extra dictionary per server/SslCertificateContext)
maybe @Tratcher has some idea how important that is.

I was also wondering if we need any other locking or grabbing extra reference on the SafeSslContextHandle.
Since we are new freeing it immediately it feels like we don't as life cycle will be linked to the SslCertificateContext.
That is not IDisposable so it will live as long as there is reference to it.

@wfurt wfurt added area-System.Net.Security os-linux Linux OS (any supported distro) labels Aug 9, 2021
@wfurt wfurt self-assigned this Aug 9, 2021
@ghost
Copy link

ghost commented Aug 9, 2021

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

contributes to #22977
fixes #49845

This still needs some more testing and cleanup but it would be great to get some early feedback @bartonjs and @stephentoub

With OpenSSL the general strategy is to create SSL_CTX and set some parameters. Then that is used to create individual SSL sessions. That does not really fit into .NET API surface where one create SslStream and then AuthenticateAs* is called with varying parameters. For that reason (most likely) we always create new SSL_CTX for each stream. Aside from doing more work, this prevents TLS resume to work as the session cache and whatever is needed is linked to SSL_CTX.

This PR is trying to reuse SSL_CTX in some cases when it seems to be safe to do so.
Here is impact on connectionclosehttps benchmark from ASP.NET on Linux.
RPS jumps ~ 21x.

crank compare main.json my-release.json

| load                   | main   | my-release |            |
| ---------------------- | ------ | ---------- | ---------- |
| CPU Usage (%)          |     99 |         96 |     -3.03% |
| Cores usage (%)        |  1,183 |      1,149 |     -2.87% |
| Working Set (MB)       |     84 |         49 |    -41.67% |
| Private Memory (MB)    |    435 |        357 |    -17.93% |
| Start Time (ms)        |      0 |          0 |            |
| First Request (ms)     |    246 |        273 |    +10.98% |
| Requests/sec           |    774 |     16,824 | +2,073.92% |
| Requests               | 11,685 |    253,760 | +2,071.67% |
| Mean latency (ms)      |   4.06 |       0.74 |    -81.82% |
| Max latency (ms)       |  31.46 |      36.03 |    +14.53% |
| Bad responses          |      0 |          0 |            |
| Socket errors          |      0 |          0 |            |
| Read throughput (MB/s) |   0.11 |       2.42 | +2,071.47% |
| Latency 50th (ms)      |   2.19 |       0.43 |    -80.27% |
| Latency 75th (ms)      |   6.79 |       0.57 |    -91.61% |
| Latency 90th (ms)      |  10.65 |       0.86 |    -91.92% |
| Latency 99th (ms)      |  16.59 |       9.32 |    -43.82% |

Windows runs always timed out for me. With #49845 claiming 15K RPS we should be on par with 16K+ unless there was another jump with 6.0. (I'll try to follow up on this with @sebastienros

At this point .NET client still does not work and for that reason performance microbenchmarks are not available.
The reason what ASP.NET shows improvement is because it is using different tool. (wrk)

Since it is getting late I'm wondering if we should add AptContext switch and ENV variable to be able to opt-out in case session reuse create problem for somebody. (not done in this PR)

TL;DR

This is somewhat arbitrary split but I decided to loosely follow Credentials lookup and use certificate, enabled TLS protocols and EncryptionPolicy as unique combination.

Since EncryptionPolicy.NoEncryption and EncryptionPolicy.AllowNoEncryption are somewhat lame and obscure I decided to avoid them for purpose of resume and I use protocols as key to cache attached to SslCertificateContext e.g. unique server certificate. When SSL_CTX exist for given combination we would use it instead of creating new one when creating SafeSslHandle.

With that, many operation we did on SSL_CTX needs to move to SSL itself. Luckily OpenSSL have basically everyone in pars e.g. each operation can be done on either one. There is lot of turn in PAL code but that should mostly be 1:1 change. I also try to use consistently naming CryptoNative_SslCtx* for operations working on SSL_CTX to make it more obvious.

The notable exception is SSL_CTX_set_alpn_select_cb that can be set only on SSL_CTX and there is no matching SSL_set_alpn_select_cb. Since the logic is common I use SslSetData and SslGetData to attache the specific ALPN list to SSL itself and I grab it again in AlpnServerSelectCallback.
That seems to work quite well.
The resining problem is that once the callback is sett it MUST choose ALPN and it cannot wet run NULL.
That creates problem for cases when ALPN is not requested. I don't know how common that is but for now I exclude that from cache lookup. That can be easily fixed if needed either be factoring ALPN to cache lookup or simply by creating parallel cache. (would cost extra dictionary per server/SslCertificateContext)
maybe @Tratcher has some idea how important that is.

I was also wondering if we need any other locking or grabbing extra reference on the SafeSslContextHandle.
Since we are new freeing it immediately it feels like we don't as life cycle will be linked to the SslCertificateContext.
That is not IDisposable so it will live as long as there is reference to it.

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Security, os-linux

Milestone: -

@wfurt wfurt closed this Aug 10, 2021
@wfurt wfurt reopened this Aug 10, 2021
@wfurt
Copy link
Member Author

wfurt commented Aug 10, 2021

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wfurt wfurt marked this pull request as ready for review August 16, 2021 04:32
@karelz
Copy link
Member

karelz commented Aug 17, 2021

Browser runs are hitting #57501, unrelated to this PR.
runtime-staging failures are also unrelated to the PR.

@karelz
Copy link
Member

karelz commented Aug 17, 2021

@stephentoub if we get your approval for code-review, we can merge (CI is OK).

@karelz karelz added this to the 7.0.0 milestone Aug 17, 2021
@wfurt
Copy link
Member Author

wfurt commented Aug 27, 2021

/azp run runtime-libraries stress-ssl

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wfurt
Copy link
Member Author

wfurt commented Aug 27, 2021

This should be ready for any final pass @stephentoub or anybody from @dotnet/ncl
CI is clean.
I addressed all @bartonjs feedback - either by changing code or providing explanation.

  • this batch has some bigger naming cleanup that makes the diff bigger but hopefully easier to understand the code
  • I re-factored some of the CipherPolicy/protocols so separate helper function (keeping actual logic same)
  • filed ALPN processing on Linux can be improved #58299 to track further ALPN improvements

I was planning to run another round of perf tests just to be sure but I have some difficulties - most likely because main moved to 7.0.

@@ -513,6 +519,33 @@ int32_t CryptoNative_SetCiphers(SSL_CTX* ctx, const char* cipherList, const char
return ret;
}

int32_t CryptoNative_SetCiphers(SSL* ssl, const char* cipherList, const char* cipherSuites)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth asserting (cipherList != NULL) ^ (cipherSuites != NULL) ?

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@halter73
Copy link
Member

halter73 commented Sep 3, 2021

That creates problem for cases when ALPN is not requested.

Does this only create a problem when SslServerAuthenticationOptions.ApplicationProtocols is unset? Or is it a problem if the client does not request ALPN? I take it from wrk showing an improvement that ALPN only needs to be requested by the server. If so, that's not problem for Kestrel since we always set this list even if only for "http/1.1".

@wfurt
Copy link
Member Author

wfurt commented Sep 3, 2021

The problem I bump into as with SslServerAuthenticationOptions.ApplicationProtocols unset @halter73. The current behavior is that we don't set the callback and OpenSSL basically ignores ALPN and you get null on both sides. When the callback is set, OpenSSL expects that the callback picks something or the handshake will fail. I did not find a way how to not choose and preserve old behavior.

@wfurt wfurt merged commit 6262ae8 into dotnet:main Sep 3, 2021
@wfurt wfurt deleted the tlsResumeLinux branch September 3, 2021 22:03
@stephentoub
Copy link
Member

@wfurt, this closed #49845. Don't we still need to track the client side as well?

@wfurt
Copy link
Member Author

wfurt commented Sep 3, 2021

#49845 was specifically about Kestrel regression and this change should be sufficient IMHO as the test does not use .NET client. I left #22977 open to track the remaining work e.g. client support.

@sebastienros
Copy link
Member

Is it getting backported to 6.0 ?

@karelz
Copy link
Member

karelz commented Sep 7, 2021

@stephentoub you originally wanted to consider it for 6.0 backport. Do we have enough wins here (e.g. in TechEmpower benchmark) to justify it?

@stephentoub
Copy link
Member

While it'd be great to see this in 6.0, at this point I'm skeptical there's enough runway to properly validate the change with low enough risk. I think we should probably just get it in for 7.0. @danmoseley, @geoffkizer, dissent?

@karelz
Copy link
Member

karelz commented Sep 7, 2021

@davidfowl might want to chime in here as well ;)
BTW: @geoffkizer is OOF this week

@davidfowl
Copy link
Member

Maybe good to follow up with the customers that pushed for this change.

@karelz
Copy link
Member

karelz commented Sep 8, 2021

@davidfowl I can't find any customer pushing for the change.

@DavidKarlas
Copy link
Contributor

DavidKarlas commented Sep 12, 2021

Hi @wfurt as part of integration tests for https://github.com/dotnet/templating we have arcade bot bumping runtime for us so our main keeps running on main of runtime to catch issues asap, it seems like this PR started causing our CI to fail(no worries its just in PR that bumps runtime so no urgency).

Error is: Cannot get required symbol SSL_set_alpn_protos from libssl
PR in question: dotnet/templating#3865

Is this runtime bug or our CI needs to update libssl or something else?

@wfurt
Copy link
Member Author

wfurt commented Sep 12, 2021

It seems like you use Ubuntu 14.04 with OpenSSL 1.0.1? They are long out of support IMHO. (even Ubuntu 16 is getting pretty old) Is there chance you can bump OpenSSL to at least 1.0.2 or update the base container @DavidKarlas ?

@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Security os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTPS handshake performance
9 participants