Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up contended HTTP/2 frame writing #40407

Merged
merged 4 commits into from
Feb 28, 2022
Merged

Speed up contended HTTP/2 frame writing #40407

merged 4 commits into from
Feb 28, 2022

Conversation

halter73
Copy link
Member

@halter73 halter73 commented Feb 25, 2022

This PR changes Http2Connection/Http2FrameWriter so that it dispatches SslStream.WriteAsync to the thread pool and releases the Http2FrameWriter._writeLock immediately. This improves our "gRPC h2 70x1" scenario (70 streams over one HTTP/2 connection with TLS) by 500% as measured by RPS.

The following YARP profile focused on Http2OutputProducer.WriteDataToPipeAsync shows how much time is currently spent spinning on the _writeLock and waiting on TLS operations given many streams writing to a single HTTP/2 connection.

Screen Shot 2022-02-24 at 4 36 48 PM

This partially addresses #30235. It doesn't go as far as using Channels as suggested in the issue, so HPACK encoding and frame writing/copying still happens with the _writeLock. However, the more expensive TLS operations do get dispatched. This allows contending streams to more quickly write their own frames to an output buffer or await without spinning on a lock as frequently.

This does introduce an extra copy similar to what we already have for HTTP/2 input, but the benchmark results clearly show this is worthwhile in order to offload the TLS work to a thread that doesn't block other HTTP/2 streams. We could avoid this copy by updating ConcurrentPipeWriter to dispatch calls to FlushAsync and WriteAsync. I didn't do that for this initial iteration because we'd want to use a pooled IValueTaskSource to support this. We'd also want to make ConcurrentPipeWriter aware of the MaxResponseBufferSize so it wouldn't always return incomplete ValueTasks for dispatched writes.

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2 --variable connections=1 --variable streams=70
application baseline Current
CPU Usage (%) 13 48 +269.23%
Cores usage (%) 370 1,355 +266.22%
Working Set (MB) 189 411 +117.46%
Private Memory (MB) 1,195 1,418 +18.66%
Build Time (ms) 4,624 4,432 -4.15%
Start Time (ms) 309 347 +12.30%
Published Size (KB) 91,293 91,293 0.00%
.NET Core SDK Version 7.0.100-preview.3.22123.26 7.0.100-preview.3.22123.26
load baseline Current
CPU Usage (%) 28 90 +221.43%
Cores usage (%) 781 2,512 +221.64%
Working Set (MB) 386 414 +7.25%
Private Memory (MB) 1,397 1,422 +1.79%
Build Time (ms) 4,660 4,573 -1.87%
Start Time (ms) 183 186 +1.64%
Published Size (KB) 80,367 80,367 0.00%
.NET Core SDK Version 6.0.200 6.0.200
Max RPS 14,485 86,983 +500.49%
Requests 72,413 434,308 +499.77%
Bad responses 0 0
Mean latency (ms) 4.83 0.80 -83.36%
Max latency (ms) 37.28 8.31 -77.71%

Even when the client can open multiple connections to reduce contention, this improves performance.

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2 --variable streams=70
application baseline100 Current
CPU Usage (%) 66 84 +27.27%
Cores usage (%) 1,856 2,361 +27.21%
Working Set (MB) 1,610 1,843 +14.47%
Private Memory (MB) 2,658 3,184 +19.79%
Build Time (ms) 4,624 4,933 +6.68%
Start Time (ms) 303 336 +10.89%
Published Size (KB) 91,293 91,293 0.00%
.NET Core SDK Version 7.0.100-preview.3.22123.26 7.0.100-preview.3.22123.26
load baseline100 Current
CPU Usage (%) 88 97 +10.23%
Cores usage (%) 2,459 2,715 +10.41%
Working Set (MB) 3,470 4,126 +18.90%
Private Memory (MB) 4,551 5,130 +12.72%
Build Time (ms) 4,287 4,658 +8.65%
Start Time (ms) 181 177 -2.21%
Published Size (KB) 80,367 80,367 0.00%
.NET Core SDK Version 6.0.200 6.0.200
Max RPS 339,472 426,355 +25.59%
Requests 1,714,671 2,158,208 +25.87%
Bad responses 0 0
Mean latency (ms) 20.52 16.35 -20.32%
Max latency (ms) 91.55 186.20 +103.38%

I also verified the non-TLS "h2c" performance does not regress.

@JamesNK
Copy link
Member

JamesNK commented Feb 25, 2022

Nice! Doing stuff in that write lock on a busy connection really is a killer 😮

How does the TLS 70x1 gRPC benchmark compare now with the non-TLS 70x1? And HTTP/3?
I'm guessing connection memory use will be higher with the extra pipe.
If we moved to using channels and eliminated the write lock altogether, would we keep the output pipe copy?

@@ -83,8 +85,34 @@ public Http2Connection(HttpConnectionContext context)
// Capture the ExecutionContext before dispatching HTTP/2 middleware. Will be restored by streams when processing request
_context.InitialExecutionContext = ExecutionContext.Capture();

var inputPipeOptions = new PipeOptions(pool: context.MemoryPool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce a new pipe... That's unfortunate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does introduce a new pipe. This part of the PR description describes how we could avoid that.

This does introduce an extra copy similar to what we already have for HTTP/2 input, but the benchmark results clearly show this is worthwhile in order to offload the TLS work to a thread that doesn't block other HTTP/2 streams. We could avoid this copy by updating ConcurrentPipeWriter to dispatch calls to FlushAsync and WriteAsync. I didn't do that for this initial iteration because we'd want to use a pooled IValueTaskSource to support this. We'd also want to make ConcurrentPipeWriter aware of the MaxResponseBufferSize so it wouldn't always return incomplete ValueTasks for dispatched writes.

I wanted to keep this simple for now though so we can possibly service this. Long term we can do the custom PipeWriter, or better yet get rid of the _writeLock altogether using Channels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like a lot of time should be invested in what should hopefully be replaced by the best solution (channels).

@halter73
Copy link
Member Author

How does the TLS 70x1 gRPC benchmark compare now with the non-TLS 70x1? And HTTP/3?

h3 70x1

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h3 --variable connections=1 --variable streams=70
application baseline Current
CPU Usage (%) 13 25 +92.31%
Cores usage (%) 370 696 +88.11%
Working Set (MB) 189 374 +97.88%
Private Memory (MB) 1,195 1,446 +21.00%
Build Time (ms) 4,624 4,636 +0.26%
Start Time (ms) 309 363 +17.48%
Published Size (KB) 91,293 91,293 0.00%
.NET Core SDK Version 7.0.100-preview.3.22123.26 7.0.100-preview.3.22123.26
load baseline Current
CPU Usage (%) 28 17 -39.29%
Cores usage (%) 781 476 -39.05%
Working Set (MB) 386 394 +2.07%
Private Memory (MB) 1,397 1,410 +0.93%
Build Time (ms) 4,660 4,132 -11.33%
Start Time (ms) 183 184 +0.55%
Published Size (KB) 80,367 80,367 0.00%
.NET Core SDK Version 6.0.200 6.0.200
Max RPS 14,485 13,045 -9.95%
Requests 72,413 65,107 -10.09%
Bad responses 0 0
Mean latency (ms) 4.83 5.37 +11.06%
Max latency (ms) 37.28 18.40 -50.64%

h3 was slower than h2 was before in this scenario though it does have a lower max latency. I almost didn't believe this one thinking it must be falling back to h2 or something, but these numbers are consistently low even when I try benchmarking h3 on the PR branch.

h2c 70x1 (main)

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2c --variable connections=1 --variable streams=70
application baseline Current
CPU Usage (%) 13 41 +215.38%
Cores usage (%) 370 1,140 +208.11%
Working Set (MB) 189 393 +107.94%
Private Memory (MB) 1,195 1,405 +17.57%
Build Time (ms) 4,624 4,644 +0.43%
Start Time (ms) 309 238 -22.98%
Published Size (KB) 91,293 91,293 0.00%
.NET Core SDK Version 7.0.100-preview.3.22123.26 7.0.100-preview.3.22123.26
load baseline Current
CPU Usage (%) 28 96 +242.86%
Cores usage (%) 781 2,686 +243.92%
Working Set (MB) 386 401 +3.89%
Private Memory (MB) 1,397 1,417 +1.43%
Build Time (ms) 4,660 4,462 -4.25%
Start Time (ms) 183 177 -3.28%
Published Size (KB) 80,367 80,367 0.00%
.NET Core SDK Version 6.0.200 6.0.200
Max RPS 14,485 97,970 +576.33%
Requests 72,413 490,342 +577.15%
Bad responses 0 0
Mean latency (ms) 4.83 0.71 -85.23%
Max latency (ms) 37.28 11.98 -67.85%

h2c 70x1 (halter73/30235)

application baseline Current
CPU Usage (%) 13 46 +253.85%
Cores usage (%) 370 1,287 +247.84%
Working Set (MB) 189 396 +109.52%
Private Memory (MB) 1,195 1,317 +10.21%
Build Time (ms) 4,624 4,502 -2.64%
Start Time (ms) 309 275 -11.00%
Published Size (KB) 91,293 91,293 0.00%
.NET Core SDK Version 7.0.100-preview.3.22123.26 7.0.100-preview.3.22123.26
load baseline Current
CPU Usage (%) 28 96 +242.86%
Cores usage (%) 781 2,691 +244.56%
Working Set (MB) 386 399 +3.37%
Private Memory (MB) 1,397 1,414 +1.22%
Build Time (ms) 4,660 4,127 -11.44%
Start Time (ms) 183 184 +0.55%
Published Size (KB) 80,367 80,367 0.00%
.NET Core SDK Version 6.0.200 6.0.200
Max RPS 14,485 99,935 +589.89%
Requests 72,413 499,373 +589.62%
Bad responses 0 0
Mean latency (ms) 4.83 0.70 -85.52%
Max latency (ms) 37.28 16.27 -56.36%

I'm guessing connection memory use will be higher with the extra pipe.

The theoretical memory use will be up to 64KB higher per HTTP/2 connection experiencing TCP backpressure by default. We already have a higher 1MB default limit for buffering the read side at this layer. And no extra memory is used when there is HTTP/2 flow control backpressure. The benchmark results do show increases in the working set and private memory, but not any more than what can be explained by the increased CPU usage.

If we moved to using channels and eliminated the write lock altogether, would we keep the output pipe copy?

We'd get rid of the output pipe copy. This is just to get the expensive TLS operations out of the lock. If we do the ConcurrentPipeWriter thing I mention in the PR description, we might not even have to go as far as using Channels to avoid the copy. I still think using Channels is likely the best option given infinite time.

Copy link
Member

@Tratcher Tratcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of change should stabilize in main for at least one preview release before we consider backporting it.

Co-authored-by: Aditya Mandaleeka <adityamandaleeka@users.noreply.github.com>
@halter73 halter73 changed the title Speed up contended HTTP/2 frame writing (500% gRPC h2 70x1 improvement) Speed up contended HTTP/2 frame writing Feb 28, 2022
@halter73 halter73 merged commit 8af6420 into main Feb 28, 2022
@halter73 halter73 deleted the halter73/30235 branch February 28, 2022 21:14
@ghost ghost added this to the 7.0-preview3 milestone Feb 28, 2022
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants