Speed up contended HTTP/2 frame writing #40407

halter73 · 2022-02-25T01:11:38Z

This PR changes Http2Connection/Http2FrameWriter so that it dispatches SslStream.WriteAsync to the thread pool and releases the Http2FrameWriter._writeLock immediately. This improves our "gRPC h2 70x1" scenario (70 streams over one HTTP/2 connection with TLS) by 500% as measured by RPS.

The following YARP profile focused on Http2OutputProducer.WriteDataToPipeAsync shows how much time is currently spent spinning on the _writeLock and waiting on TLS operations given many streams writing to a single HTTP/2 connection.

This partially addresses #30235. It doesn't go as far as using Channels as suggested in the issue, so HPACK encoding and frame writing/copying still happens with the _writeLock. However, the more expensive TLS operations do get dispatched. This allows contending streams to more quickly write their own frames to an output buffer or await without spinning on a lock as frequently.

This does introduce an extra copy similar to what we already have for HTTP/2 input, but the benchmark results clearly show this is worthwhile in order to offload the TLS work to a thread that doesn't block other HTTP/2 streams. We could avoid this copy by updating ConcurrentPipeWriter to dispatch calls to FlushAsync and WriteAsync. I didn't do that for this initial iteration because we'd want to use a pooled IValueTaskSource to support this. We'd also want to make ConcurrentPipeWriter aware of the MaxResponseBufferSize so it wouldn't always return incomplete ValueTasks for dispatched writes.

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2 --variable connections=1 --variable streams=70

application	baseline	Current
CPU Usage (%)	13	48	+269.23%
Cores usage (%)	370	1,355	+266.22%
Working Set (MB)	189	411	+117.46%
Private Memory (MB)	1,195	1,418	+18.66%
Build Time (ms)	4,624	4,432	-4.15%
Start Time (ms)	309	347	+12.30%
Published Size (KB)	91,293	91,293	0.00%
.NET Core SDK Version	7.0.100-preview.3.22123.26	7.0.100-preview.3.22123.26

load	baseline	Current
CPU Usage (%)	28	90	+221.43%
Cores usage (%)	781	2,512	+221.64%
Working Set (MB)	386	414	+7.25%
Private Memory (MB)	1,397	1,422	+1.79%
Build Time (ms)	4,660	4,573	-1.87%
Start Time (ms)	183	186	+1.64%
Published Size (KB)	80,367	80,367	0.00%
.NET Core SDK Version	6.0.200	6.0.200
Max RPS	14,485	86,983	+500.49%
Requests	72,413	434,308	+499.77%
Bad responses	0	0
Mean latency (ms)	4.83	0.80	-83.36%
Max latency (ms)	37.28	8.31	-77.71%

Even when the client can open multiple connections to reduce contention, this improves performance.

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2 --variable streams=70

application	baseline100	Current
CPU Usage (%)	66	84	+27.27%
Cores usage (%)	1,856	2,361	+27.21%
Working Set (MB)	1,610	1,843	+14.47%
Private Memory (MB)	2,658	3,184	+19.79%
Build Time (ms)	4,624	4,933	+6.68%
Start Time (ms)	303	336	+10.89%
Published Size (KB)	91,293	91,293	0.00%
.NET Core SDK Version	7.0.100-preview.3.22123.26	7.0.100-preview.3.22123.26

load	baseline100	Current
CPU Usage (%)	88	97	+10.23%
Cores usage (%)	2,459	2,715	+10.41%
Working Set (MB)	3,470	4,126	+18.90%
Private Memory (MB)	4,551	5,130	+12.72%
Build Time (ms)	4,287	4,658	+8.65%
Start Time (ms)	181	177	-2.21%
Published Size (KB)	80,367	80,367	0.00%
.NET Core SDK Version	6.0.200	6.0.200
Max RPS	339,472	426,355	+25.59%
Requests	1,714,671	2,158,208	+25.87%
Bad responses	0	0
Mean latency (ms)	20.52	16.35	-20.32%
Max latency (ms)	91.55	186.20	+103.38%

I also verified the non-TLS "h2c" performance does not regress.

JamesNK · 2022-02-25T02:34:17Z

Nice! Doing stuff in that write lock on a busy connection really is a killer 😮

How does the TLS 70x1 gRPC benchmark compare now with the non-TLS 70x1? And HTTP/3?
I'm guessing connection memory use will be higher with the extra pipe.
If we moved to using channels and eliminated the write lock altogether, would we keep the output pipe copy?

davidfowl · 2022-02-25T02:38:22Z

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

@@ -83,8 +85,34 @@ public Http2Connection(HttpConnectionContext context)
        // Capture the ExecutionContext before dispatching HTTP/2 middleware. Will be restored by streams when processing request
        _context.InitialExecutionContext = ExecutionContext.Capture();

+        var inputPipeOptions = new PipeOptions(pool: context.MemoryPool,


Does this introduce a new pipe... That's unfortunate.

It does introduce a new pipe. This part of the PR description describes how we could avoid that.

This does introduce an extra copy similar to what we already have for HTTP/2 input, but the benchmark results clearly show this is worthwhile in order to offload the TLS work to a thread that doesn't block other HTTP/2 streams. We could avoid this copy by updating ConcurrentPipeWriter to dispatch calls to FlushAsync and WriteAsync. I didn't do that for this initial iteration because we'd want to use a pooled IValueTaskSource to support this. We'd also want to make ConcurrentPipeWriter aware of the MaxResponseBufferSize so it wouldn't always return incomplete ValueTasks for dispatched writes.

I wanted to keep this simple for now though so we can possibly service this. Long term we can do the custom PipeWriter, or better yet get rid of the _writeLock altogether using Channels.

I don't feel like a lot of time should be invested in what should hopefully be replaced by the best solution (channels).

halter73 · 2022-02-25T03:15:30Z

How does the TLS 70x1 gRPC benchmark compare now with the non-TLS 70x1? And HTTP/3?

h3 70x1

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h3 --variable connections=1 --variable streams=70

application	baseline	Current
CPU Usage (%)	13	25	+92.31%
Cores usage (%)	370	696	+88.11%
Working Set (MB)	189	374	+97.88%
Private Memory (MB)	1,195	1,446	+21.00%
Build Time (ms)	4,624	4,636	+0.26%
Start Time (ms)	309	363	+17.48%
Published Size (KB)	91,293	91,293	0.00%
.NET Core SDK Version	7.0.100-preview.3.22123.26	7.0.100-preview.3.22123.26

load	baseline	Current
CPU Usage (%)	28	17	-39.29%
Cores usage (%)	781	476	-39.05%
Working Set (MB)	386	394	+2.07%
Private Memory (MB)	1,397	1,410	+0.93%
Build Time (ms)	4,660	4,132	-11.33%
Start Time (ms)	183	184	+0.55%
Published Size (KB)	80,367	80,367	0.00%
.NET Core SDK Version	6.0.200	6.0.200
Max RPS	14,485	13,045	-9.95%
Requests	72,413	65,107	-10.09%
Bad responses	0	0
Mean latency (ms)	4.83	5.37	+11.06%
Max latency (ms)	37.28	18.40	-50.64%

h3 was slower than h2 was before in this scenario though it does have a lower max latency. I almost didn't believe this one thinking it must be falling back to h2 or something, but these numbers are consistently low even when I try benchmarking h3 on the PR branch.

h2c 70x1 (main)

> crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile aspnet-citrine-lin --variable protocol=h2c --variable connections=1 --variable streams=70

application	baseline	Current
CPU Usage (%)	13	41	+215.38%
Cores usage (%)	370	1,140	+208.11%
Working Set (MB)	189	393	+107.94%
Private Memory (MB)	1,195	1,405	+17.57%
Build Time (ms)	4,624	4,644	+0.43%
Start Time (ms)	309	238	-22.98%
Published Size (KB)	91,293	91,293	0.00%
.NET Core SDK Version	7.0.100-preview.3.22123.26	7.0.100-preview.3.22123.26

load	baseline	Current
CPU Usage (%)	28	96	+242.86%
Cores usage (%)	781	2,686	+243.92%
Working Set (MB)	386	401	+3.89%
Private Memory (MB)	1,397	1,417	+1.43%
Build Time (ms)	4,660	4,462	-4.25%
Start Time (ms)	183	177	-3.28%
Published Size (KB)	80,367	80,367	0.00%
.NET Core SDK Version	6.0.200	6.0.200
Max RPS	14,485	97,970	+576.33%
Requests	72,413	490,342	+577.15%
Bad responses	0	0
Mean latency (ms)	4.83	0.71	-85.23%
Max latency (ms)	37.28	11.98	-67.85%

h2c 70x1 (halter73/30235)

application	baseline	Current
CPU Usage (%)	13	46	+253.85%
Cores usage (%)	370	1,287	+247.84%
Working Set (MB)	189	396	+109.52%
Private Memory (MB)	1,195	1,317	+10.21%
Build Time (ms)	4,624	4,502	-2.64%
Start Time (ms)	309	275	-11.00%
Published Size (KB)	91,293	91,293	0.00%
.NET Core SDK Version	7.0.100-preview.3.22123.26	7.0.100-preview.3.22123.26

load	baseline	Current
CPU Usage (%)	28	96	+242.86%
Cores usage (%)	781	2,691	+244.56%
Working Set (MB)	386	399	+3.37%
Private Memory (MB)	1,397	1,414	+1.22%
Build Time (ms)	4,660	4,127	-11.44%
Start Time (ms)	183	184	+0.55%
Published Size (KB)	80,367	80,367	0.00%
.NET Core SDK Version	6.0.200	6.0.200
Max RPS	14,485	99,935	+589.89%
Requests	72,413	499,373	+589.62%
Bad responses	0	0
Mean latency (ms)	4.83	0.70	-85.52%
Max latency (ms)	37.28	16.27	-56.36%

I'm guessing connection memory use will be higher with the extra pipe.

The theoretical memory use will be up to 64KB higher per HTTP/2 connection experiencing TCP backpressure by default. We already have a higher 1MB default limit for buffering the read side at this layer. And no extra memory is used when there is HTTP/2 flow control backpressure. The benchmark results do show increases in the working set and private memory, but not any more than what can be explained by the increased CPU usage.

If we moved to using channels and eliminated the write lock altogether, would we keep the output pipe copy?

We'd get rid of the output pipe copy. This is just to get the expensive TLS operations out of the lock. If we do the ConcurrentPipeWriter thing I mention in the PR description, we might not even have to go as far as using Channels to avoid the copy. I still think using Channels is likely the best option given infinite time.

Tratcher

This kind of change should stabilize in main for at least one preview release before we consider backporting it.

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs

Co-authored-by: Aditya Mandaleeka <adityamandaleeka@users.noreply.github.com>

halter73 requested review from davidfowl, JamesNK and sebastienros February 25, 2022 01:11

halter73 requested review from Tratcher and BrennanConroy as code owners February 25, 2022 01:11

ghost added the area-runtime label Feb 25, 2022

Speed up contended HTTP/2 frame writing

5477e28

halter73 force-pushed the halter73/30235 branch from a402eec to 5477e28 Compare February 25, 2022 01:25

davidfowl reviewed Feb 25, 2022

View reviewed changes

Fix tests

4c2f19f

halter73 force-pushed the halter73/30235 branch from 2b74c0d to 4c2f19f Compare February 25, 2022 04:33

CopyPipeAsync

ff79d09

halter73 force-pushed the halter73/30235 branch from 613128f to ff79d09 Compare February 25, 2022 05:00

Tratcher approved these changes Feb 25, 2022

View reviewed changes

adityamandaleeka reviewed Feb 25, 2022

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs Outdated Show resolved Hide resolved

adityamandaleeka reviewed Feb 25, 2022

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http2/Http2Connection.cs Outdated Show resolved Hide resolved

Apply suggestions from code review

f903890

Co-authored-by: Aditya Mandaleeka <adityamandaleeka@users.noreply.github.com>

halter73 changed the title ~~Speed up contended HTTP/2 frame writing (500% gRPC h2 70x1 improvement)~~ Speed up contended HTTP/2 frame writing Feb 28, 2022

halter73 merged commit 8af6420 into main Feb 28, 2022

halter73 deleted the halter73/30235 branch February 28, 2022 21:14

ghost added this to the 7.0-preview3 milestone Feb 28, 2022

amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023

github-actions bot locked and limited conversation to collaborators Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up contended HTTP/2 frame writing #40407

Speed up contended HTTP/2 frame writing #40407

halter73 commented Feb 25, 2022 •

edited

Loading

JamesNK commented Feb 25, 2022

davidfowl Feb 25, 2022

halter73 Feb 25, 2022

JamesNK Feb 25, 2022

halter73 commented Feb 25, 2022

Tratcher left a comment

Speed up contended HTTP/2 frame writing #40407

Speed up contended HTTP/2 frame writing #40407

Conversation

halter73 commented Feb 25, 2022 • edited Loading

JamesNK commented Feb 25, 2022

davidfowl Feb 25, 2022

Choose a reason for hiding this comment

halter73 Feb 25, 2022

Choose a reason for hiding this comment

JamesNK Feb 25, 2022

Choose a reason for hiding this comment

halter73 commented Feb 25, 2022

h3 70x1

h2c 70x1 (main)

h2c 70x1 (halter73/30235)

Tratcher left a comment

Choose a reason for hiding this comment

halter73 commented Feb 25, 2022 •

edited

Loading