Reduce byte array allocations when reading/writing packets #6023

KirillOsenkov · 2021-01-12T00:30:13Z

Byte arrays are a major source of LOH allocations when streaming logging events across nodes. Allocating a large MemoryStream once and then growing it as needed almost completely removes allocations for byte arrays.

This should significantly improve memory traffic during large builds.

Before:

After:

Byte arrays are a major source of LOH allocations when streaming logging events across nodes. Allocating a large MemoryStream once and then growing it as needed almost completely removes allocations for byte arrays. This should significantly improve memory traffic during large builds.

KirillOsenkov · 2021-01-12T00:31:42Z

This should significantly improve memory pressure during large builds:
#2168

Strings and char[] arrays will still be allocated so it's not a full solution. But some low hanging fruit that I hope will bring some improvement.

@mmitche FYI

KirillOsenkov · 2021-01-12T01:24:13Z

I'm looking at Mono failures... presumably I broke something around FEATURE_APM... a packet arrives, of the right length, but contains junk...

KirillOsenkov · 2021-01-12T06:00:52Z

Part 3 works, everything else fails.

This commit breaks it:
9b123c4

A lock around the method doesn't help.

I'm stumped! Will go for a walk and try to attack it tomorrow.

KirillOsenkov · 2021-01-12T06:07:41Z

Hey @dsplaisted do you have any wisdom for me? This commit is trying to reuse the MemoryStream:
9b123c4

and it breaks Mono, see build failures. I've isolated it to this change. I'd try a pooled memory stream but first I want to understand what is going on? Are two of these methods running simultaneously? Why does the lock not help?

dsplaisted · 2021-01-12T23:08:28Z

It looks like the initial commit is your first attempt and then you're generally just backing off the changes to try to get it to work?

I'm not sure why it's not working. I don't think multiple reads or writes should be going on at the same time in a single process. Maybe a read and write can be happening at the same time and that's causing an issue? But I don't know.

cdmihai · 2021-01-14T01:25:28Z

What if you make the fields thread local? Every thread would get its own cached byte array and avoid data races. If that fixes the problem it means that that method is indeed multi threaded, which is a surprise

KirillOsenkov · 2021-01-17T04:12:06Z

The command line I use locally for testing on Mac:

eng/cibuild_bootstrapped_msbuild.sh --host_type mono /p:SuppressLicenseValidation=true

KirillOsenkov · 2021-01-17T04:19:23Z

Here's the raw MSBuild invocation:

./build.sh /p:CreateBootstrap=true /p:MonoBuild=true

mono /Users/kirill/MSB/artifacts/bin/bootstrap/net472/MSBuild/Current/Bin/MSBuild.dll /bl /clp:Summary /m /nr:false /p:ContinuousIntegrationBuild=true /p:Configuration=Debug-MONO /p:RepoRoot=/Users/kirill/MSB /p:Restore=true /p:Build=true /p:Rebuild=false /p:Pack=false /p:IntegrationTest=false /p:PerformanceTest=false /p:Sign=false /p:Publish=false /p:CreateBootstrap=false /p:SuppressLicenseValidation=true /p:MSBuildExtensionsPath=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:MSBuildExtensionsPath32=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:MSBuildExtensionsPath64=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:DeterministicSourcePaths=false /v:minimal /warnaserror /Users/kirill/.nuget/packages/microsoft.dotnet.arcade.sdk/1.0.0-beta.20580.3/tools/Build.proj

KirillOsenkov · 2021-01-17T06:54:53Z

OK I understand what's happening:

this async call doesn't await and returns:

msbuild/src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs

Line 762 in 34bbbed

_serverToClientStream.WriteAsync(writeStreamBuffer, i, lengthToWrite);

and then since the array is reused the Unix implementation of PipeStream trips over itself:
https://github.com/mono/corefx/blob/7c24bb0756fd39fbf09b8777f25c15d21d78eb46/src/System.IO.Pipes/src/System/IO/Pipes/PipeStream.Unix.cs#L263

Because we queue a new call to SendAsync before the previous call returns and frees up our array.

So we can't reuse the array if we don't await on line 762 or use the synchronous Write method on the NamedPipeClientStream.

If we use WriteAsync and don't await it, then subsequent WriteAsync may be called before the first continuation returns. If both calls share the same buffer and they overlap, we will overwrite the data in the buffer and cause junk to arrive at receiver.

KirillOsenkov · 2021-01-17T18:38:22Z

OK folks, note one significant change: on Mono (== when FEATURE_APM is not set) we switch the last write of the packet to be synchronous as well:
6eb445d

The other codepath (used by Windows) doesn't seem to have a problem sharing the array because it ends up in a WriteFileNative call where we fully read the array segment we're passed before returning from BeginWrite.

On Mono it used to be a fire-and-forget async task (WriteAsync was not awaited), and it led to new writes being called when the previous write didn't return. Not only it gets in the way of reusing the array, but more importantly I think it can explain some weirdness I saw with Mono parallel builds: mono/mono#20669

Since the Mono parallel build is already slow enough, I think making this change won't make things much worse. There are much more impactful things we can do to improve perf there.

dsplaisted · 2021-01-18T02:55:32Z

@KirillOsenkov Great job on figuring out what was wrong on Mono.

However, it seems like we're relying on the implementation of BeginWrite to copy the data out of the buffer we pass before returning. Is it safe to assume that it will do that? Should we instead switch to a buffer pool where the buffer is released in PacketWriteComplete?

KirillOsenkov · 2021-01-18T04:17:08Z

I had the same concern.

It would make things much simpler if the last write was synchronous as well. Are we sure that making that last write fire-and-forget async is significant for perf?

cdmihai · 2021-01-18T16:57:28Z

Are we sure that making that last write fire-and-forget async is significant for perf?

Given the excellent documentation, there's no way to know but to try it out. You can try and build orchard core with /m and do an RPS run. My gut feeling is that it won't affect perf too much because most packets are fairly short. The biggest ones that I'm aware of are ProjectInstance and BuildResult objects, and there's only two of those per built project. ProjectInstance objects are not sent during cmdline builds, only in VS builds, so running RPS would be good. Logging packets are the most frequent, but how large can those be? 😊. And besides, AFAIK packet sending happens on a separate thread than the RequestBuilder so it won't hold up builds, worst case it will congest the send queue.

KirillOsenkov · 2021-01-18T19:04:57Z

So I ran some numbers on a medium-size fully built incremental solution (so the C# compiler is never invoked).

master:	memoryStream:	sync:
21.699	19.871	23.905
22.32	16.993	19.276
20.119	21.856	21.166
20.414	21.569	20.705
22.196	19.274	19.269

First column is current master. Second is the PR before I made the writing synchronous. The last one is the last commit where I write the entire packet synchronously. Of course this is just a toy experiment and we'd need to run RPS but this confirms my intuition that making things sync here likely won't create much of contention.

@cdmihai logging packets can easily get huge, ~ 5MB, for logging of task inputs, task outputs, item adds and removes (all items with metadata are concatenated into a giant 5MB string and sent across the wire in a single BuildMessageEventArgs). You can see some stats here:
https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/BinLog-Stats

Here's a sample part of a single packet sent by RAR:
https://gist.github.com/KirillOsenkov/8aa84151b92b3837dbd24227e834df58 :)

ladipro

Looks good!

ladipro · 2021-01-21T23:18:22Z

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs

-                        int lengthToWrite = Math.Min((int)writeStream.Length - i, MaxPacketWriteSize);
-                        if ((int)writeStream.Length - i <= MaxPacketWriteSize)
+                        int lengthToWrite = Math.Min(writeStreamLength - i, MaxPacketWriteSize);
+                        if (writeStreamLength - i <= MaxPacketWriteSize)


This if looks redundant now, both branches do the same thing.

KirillOsenkov · 2021-01-22T00:48:38Z

This is not ready yet - we have a real perf regression identified by @cdmihai:

70d3538
21.79
22.31
22.13
22.41
20.58
 
6eb445d (Write the last part of the packet synchronously on Mono.)
22.94
20.68
21.03
22.46
21.54

dc5f1fb7 (Write the entire packet in SendData() synchronously.)
25.29
26.14
23.74
25.42
23.57

Additionally, we have an RPS failure, because we now preallocate two 1 MB byte arrays per worker node, which RPS caught:

I'll need to confirm the overall memory savings on larger builds and if it is as significant as I saw, we'll need to ask for an RPS exception to allow us to take an initial hit, in exchange for almost no byte[] allocations down the line in this area.

Also I'll need to look if I can make the entire SendData() async (fire-and-forget) so that we don't block on it finishing. If switching to sync for the last part of the packet has caused such a dramatic slowdown maybe we can squeeze more out of it if we move the entire SendData() call to a different async queue to queue up a packet and return immediately.

ladipro · 2021-01-22T17:01:10Z

Additionally, we have an RPS failure, because we now preallocate two 1 MB byte arrays per worker node

Would it make sense to start with smaller arrays then and pay the resizing cost in scenarios where a lot of data is moved between nodes?

KirillOsenkov · 2021-01-22T22:27:58Z

@ladipro certainly a simple way to trick RPS ;) I'll try that.

KirillOsenkov · 2021-01-22T22:28:19Z

reminds me of Volkswagen for some reason

This avoids blocking the main loop. Roughly equivalent to writing the packet asynchronously using fire-and-forget (BeginWrite).

KirillOsenkov · 2021-01-25T03:13:43Z

I'm experimenting with moving SendData() off the main loop (to keep BuildManager.ProcessWorkQueue() unblocked).
So far I'm seeing decent numbers (slightly better than anything seen so far).

Measuring this is incredibly tricky. I've spent two full days trying out various approaches, measuring various codebases, with various flags (/bl, /graph), debug vs. release, last write sync vs. async, MemoryStreamPool vs. a single memory stream, attempts at optimizing DrainPacketQueue(), etc. The numbers fluctuate and are highly inconclusive. There would be outliers and no clear picture. But the trend does show a little bit, so at least you can compare each pair of approaches.

This is a rough draft, and if the approach looks OK I'll rebase and cleanup the code and add comments and such.

Here are my crazy numbers from the last two days:

KirillOsenkov · 2021-01-25T03:17:27Z

Here are the final numbers I got that matter the most:

`70d3538`	`6eb445d`	`c89545c`
23.817	22.403	21.04
22.167	22.893	19.805
21.023	20.327	20.531
22.526	21.806	20.294
18.419	21.037	18.998
19.652	19.931	20.176

70d3538 is the base commit of this PR. 6eb445d reuses MemoryStreams. c89545c is the last commit which makes SendData return immediately and writes the packets on a separate thread asynchronously.

KirillOsenkov · 2021-01-25T04:56:12Z

I pushed this branch to exp/kirillo/memoryStream to see if RPS is more happy this way

KirillOsenkov · 2021-01-26T05:38:31Z

Unfortunately this is a no-go for now. On a single threaded build this PR is noticeably slower than master:

master	#6023
32.377	38.573
31.758	34.411
31.481	33.057
32.62	33.376
32.495	32.679
32.6	33.203

I'll need to profile and investigate why.

KirillOsenkov · 2021-01-26T05:55:11Z

Hmm, I've remeasured again and now it's on par with master for single-node:

32.847
32.151
32.023
32.876
32.028

I profiled and saw nothing under profiler that I touch. I even turned off Defender for the path.

KirillOsenkov · 2021-01-26T19:40:13Z

Without /bl and with /m the numbers are ever so slightly more favorable for this PR:

master	#6023
11.324	11.057
11.809	10.743
10.833	10.339
12.028	11.245
10.762	10.976
10.359	10.931

KirillOsenkov · 2021-01-26T20:02:58Z

This should be ready for final review.

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs

KirillOsenkov · 2021-01-29T19:23:19Z

@ladipro yup, Tasks were designed to point forward (to continuations), but not backward (a Task doesn’t know what started it and doesn’t hold on to predecessors). So the chain would just fall off and get GCed.

davkean · 2021-02-08T03:56:14Z

@KirillOsenkov This looks awesome. What's holding this up from getting in? Are we targeting 16.9 or 16.10?

KirillOsenkov · 2021-02-08T05:38:47Z

16.10. I’m waiting for you guys to merge this :)

davkean · 2021-02-08T06:14:44Z

Is this targeting 16.10 because of the risk?

KirillOsenkov · 2021-02-08T08:33:45Z

Yeah, not enough runway for 16.9 to get some thorough testing and dogfooding time

…6023)" This reverts commit de3b887.

KirillOsenkov added the WIP label Jan 12, 2021

KirillOsenkov force-pushed the dev/kirillo/memoryStream branch from 0760a0d to 6eb445d Compare January 17, 2021 07:32

KirillOsenkov removed the WIP label Jan 17, 2021

KirillOsenkov requested review from dsplaisted and ladipro January 17, 2021 18:33

ladipro approved these changes Jan 21, 2021

View reviewed changes

KirillOsenkov added the WIP label Jan 22, 2021

KirillOsenkov added 3 commits January 24, 2021 17:16

Remove default MemoryStream size.

8dea479

Get rid of BinaryWriter.

7f6b358

Make SendData write packets asynchronously.

c89545c

This avoids blocking the main loop. Roughly equivalent to writing the packet asynchronously using fire-and-forget (BeginWrite).

KirillOsenkov force-pushed the dev/kirillo/memoryStream branch from dc5f1fb to c89545c Compare January 25, 2021 03:08

Introduce local to simplify the diff.

ccfec12

Add comments.

4ba9f1a

KirillOsenkov removed the WIP label Jan 26, 2021

cdmihai reviewed Jan 27, 2021

View reviewed changes

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs Outdated Show resolved Hide resolved

cdmihai reviewed Jan 27, 2021

View reviewed changes

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs Show resolved Hide resolved

cdmihai approved these changes Jan 27, 2021

View reviewed changes

Added a clarification.

5eca043

ladipro approved these changes Jan 29, 2021

View reviewed changes

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs Show resolved Hide resolved

Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jan 29, 2021

KirillOsenkov mentioned this pull request Jan 31, 2021

When sharing the terminal with child nodes, wait for the children to terminate before exiting ourselves. #6053

Merged

benvillalobos added the performance label Feb 2, 2021

Forgind merged commit de3b887 into master Feb 9, 2021

Forgind deleted the dev/kirillo/memoryStream branch February 9, 2021 00:24

benvillalobos added a commit that referenced this pull request Feb 19, 2021

Revert "Reduce byte array allocations when reading/writing packets (#…

6a68543

…6023)" This reverts commit de3b887.

ladipro mentioned this pull request Mar 2, 2021

Reduce GC pressure in InterningBinaryReader.ReadString() #3210

Closed

ladipro mentioned this pull request Nov 16, 2021

Communication between VS -> MSBuild is pushing 6% (100MB) of allocations onto the LOH #3953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce byte array allocations when reading/writing packets #6023

Reduce byte array allocations when reading/writing packets #6023

KirillOsenkov commented Jan 12, 2021

KirillOsenkov commented Jan 12, 2021

KirillOsenkov commented Jan 12, 2021

KirillOsenkov commented Jan 12, 2021 •

edited

Loading

KirillOsenkov commented Jan 12, 2021

dsplaisted commented Jan 12, 2021

cdmihai commented Jan 14, 2021 •

edited

Loading

KirillOsenkov commented Jan 17, 2021 •

edited

Loading

KirillOsenkov commented Jan 17, 2021 •

edited

Loading

KirillOsenkov commented Jan 17, 2021

KirillOsenkov commented Jan 17, 2021

dsplaisted commented Jan 18, 2021

KirillOsenkov commented Jan 18, 2021

cdmihai commented Jan 18, 2021

KirillOsenkov commented Jan 18, 2021 •

edited

Loading

ladipro left a comment

ladipro Jan 21, 2021

KirillOsenkov Jan 22, 2021

KirillOsenkov commented Jan 22, 2021

ladipro commented Jan 22, 2021

KirillOsenkov commented Jan 22, 2021

KirillOsenkov commented Jan 22, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 26, 2021 •

edited

Loading

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 29, 2021

davkean commented Feb 8, 2021

KirillOsenkov commented Feb 8, 2021

davkean commented Feb 8, 2021

KirillOsenkov commented Feb 8, 2021

Reduce byte array allocations when reading/writing packets #6023

Reduce byte array allocations when reading/writing packets #6023

Conversation

KirillOsenkov commented Jan 12, 2021

Before:

After:

KirillOsenkov commented Jan 12, 2021

KirillOsenkov commented Jan 12, 2021

KirillOsenkov commented Jan 12, 2021 • edited Loading

KirillOsenkov commented Jan 12, 2021

dsplaisted commented Jan 12, 2021

cdmihai commented Jan 14, 2021 • edited Loading

KirillOsenkov commented Jan 17, 2021 • edited Loading

KirillOsenkov commented Jan 17, 2021 • edited Loading

KirillOsenkov commented Jan 17, 2021

KirillOsenkov commented Jan 17, 2021

dsplaisted commented Jan 18, 2021

KirillOsenkov commented Jan 18, 2021

cdmihai commented Jan 18, 2021

KirillOsenkov commented Jan 18, 2021 • edited Loading

ladipro left a comment

Choose a reason for hiding this comment

ladipro Jan 21, 2021

Choose a reason for hiding this comment

KirillOsenkov Jan 22, 2021

Choose a reason for hiding this comment

KirillOsenkov commented Jan 22, 2021

ladipro commented Jan 22, 2021

KirillOsenkov commented Jan 22, 2021

KirillOsenkov commented Jan 22, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 25, 2021

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 26, 2021 • edited Loading

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 26, 2021

KirillOsenkov commented Jan 29, 2021

davkean commented Feb 8, 2021

KirillOsenkov commented Feb 8, 2021

davkean commented Feb 8, 2021

KirillOsenkov commented Feb 8, 2021

KirillOsenkov commented Jan 12, 2021 •

edited

Loading

cdmihai commented Jan 14, 2021 •

edited

Loading

KirillOsenkov commented Jan 17, 2021 •

edited

Loading

KirillOsenkov commented Jan 17, 2021 •

edited

Loading

KirillOsenkov commented Jan 18, 2021 •

edited

Loading

KirillOsenkov commented Jan 26, 2021 •

edited

Loading