Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce byte array allocations when reading/writing packets #6023

Merged
merged 8 commits into from
Feb 9, 2021

Conversation

KirillOsenkov
Copy link
Member

Byte arrays are a major source of LOH allocations when streaming logging events across nodes. Allocating a large MemoryStream once and then growing it as needed almost completely removes allocations for byte arrays.

This should significantly improve memory traffic during large builds.

Before:

image

After:

image

Byte arrays are a major source of LOH allocations when streaming logging events across nodes. Allocating a large MemoryStream once and then growing it as needed almost completely removes allocations for byte arrays.

This should significantly improve memory traffic during large builds.
@KirillOsenkov
Copy link
Member Author

This should significantly improve memory pressure during large builds:
#2168

Strings and char[] arrays will still be allocated so it's not a full solution. But some low hanging fruit that I hope will bring some improvement.

@mmitche FYI

@KirillOsenkov
Copy link
Member Author

I'm looking at Mono failures... presumably I broke something around FEATURE_APM... a packet arrives, of the right length, but contains junk...

@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 12, 2021

Part 3 works, everything else fails.

This commit breaks it:
9b123c4

A lock around the method doesn't help.

I'm stumped! Will go for a walk and try to attack it tomorrow.

@KirillOsenkov
Copy link
Member Author

Hey @dsplaisted do you have any wisdom for me? This commit is trying to reuse the MemoryStream:
9b123c4

and it breaks Mono, see build failures. I've isolated it to this change. I'd try a pooled memory stream but first I want to understand what is going on? Are two of these methods running simultaneously? Why does the lock not help?

@dsplaisted
Copy link
Member

It looks like the initial commit is your first attempt and then you're generally just backing off the changes to try to get it to work?

I'm not sure why it's not working. I don't think multiple reads or writes should be going on at the same time in a single process. Maybe a read and write can be happening at the same time and that's causing an issue? But I don't know.

@cdmihai
Copy link
Contributor

cdmihai commented Jan 14, 2021

What if you make the fields thread local? Every thread would get its own cached byte array and avoid data races. If that fixes the problem it means that that method is indeed multi threaded, which is a surprise

@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 17, 2021

The command line I use locally for testing on Mac:

eng/cibuild_bootstrapped_msbuild.sh --host_type mono /p:SuppressLicenseValidation=true

@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 17, 2021

Here's the raw MSBuild invocation:

./build.sh /p:CreateBootstrap=true /p:MonoBuild=true

mono /Users/kirill/MSB/artifacts/bin/bootstrap/net472/MSBuild/Current/Bin/MSBuild.dll /bl /clp:Summary /m /nr:false /p:ContinuousIntegrationBuild=true /p:Configuration=Debug-MONO /p:RepoRoot=/Users/kirill/MSB /p:Restore=true /p:Build=true /p:Rebuild=false /p:Pack=false /p:IntegrationTest=false /p:PerformanceTest=false /p:Sign=false /p:Publish=false /p:CreateBootstrap=false /p:SuppressLicenseValidation=true /p:MSBuildExtensionsPath=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:MSBuildExtensionsPath32=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:MSBuildExtensionsPath64=/Users/kirill/MSB/stage1/bin/bootstrap/net472/MSBuild/Current/Bin/Extensions /p:DeterministicSourcePaths=false /v:minimal /warnaserror /Users/kirill/.nuget/packages/microsoft.dotnet.arcade.sdk/1.0.0-beta.20580.3/tools/Build.proj

@KirillOsenkov
Copy link
Member Author

OK I understand what's happening:

this async call doesn't await and returns:

_serverToClientStream.WriteAsync(writeStreamBuffer, i, lengthToWrite);

and then since the array is reused the Unix implementation of PipeStream trips over itself:
https://github.com/mono/corefx/blob/7c24bb0756fd39fbf09b8777f25c15d21d78eb46/src/System.IO.Pipes/src/System/IO/Pipes/PipeStream.Unix.cs#L263

Because we queue a new call to SendAsync before the previous call returns and frees up our array.

So we can't reuse the array if we don't await on line 762 or use the synchronous Write method on the NamedPipeClientStream.

If we use WriteAsync and don't await it, then subsequent WriteAsync may be called before the first continuation returns. If both calls share the same buffer and they overlap, we will overwrite the data in the buffer and cause junk to arrive at receiver.
@KirillOsenkov
Copy link
Member Author

OK folks, note one significant change: on Mono (== when FEATURE_APM is not set) we switch the last write of the packet to be synchronous as well:
6eb445d

The other codepath (used by Windows) doesn't seem to have a problem sharing the array because it ends up in a WriteFileNative call where we fully read the array segment we're passed before returning from BeginWrite.

On Mono it used to be a fire-and-forget async task (WriteAsync was not awaited), and it led to new writes being called when the previous write didn't return. Not only it gets in the way of reusing the array, but more importantly I think it can explain some weirdness I saw with Mono parallel builds: mono/mono#20669

Since the Mono parallel build is already slow enough, I think making this change won't make things much worse. There are much more impactful things we can do to improve perf there.

@dsplaisted
Copy link
Member

@KirillOsenkov Great job on figuring out what was wrong on Mono.

However, it seems like we're relying on the implementation of BeginWrite to copy the data out of the buffer we pass before returning. Is it safe to assume that it will do that? Should we instead switch to a buffer pool where the buffer is released in PacketWriteComplete?

@KirillOsenkov
Copy link
Member Author

I had the same concern.

It would make things much simpler if the last write was synchronous as well. Are we sure that making that last write fire-and-forget async is significant for perf?

@cdmihai
Copy link
Contributor

cdmihai commented Jan 18, 2021

Are we sure that making that last write fire-and-forget async is significant for perf?

Given the excellent documentation, there's no way to know but to try it out. You can try and build orchard core with /m and do an RPS run. My gut feeling is that it won't affect perf too much because most packets are fairly short. The biggest ones that I'm aware of are ProjectInstance and BuildResult objects, and there's only two of those per built project. ProjectInstance objects are not sent during cmdline builds, only in VS builds, so running RPS would be good. Logging packets are the most frequent, but how large can those be? 😊. And besides, AFAIK packet sending happens on a separate thread than the RequestBuilder so it won't hold up builds, worst case it will congest the send queue.

@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 18, 2021

So I ran some numbers on a medium-size fully built incremental solution (so the C# compiler is never invoked).

master: memoryStream: sync:
21.699 19.871 23.905
22.32 16.993 19.276
20.119 21.856 21.166
20.414 21.569 20.705
22.196 19.274 19.269

First column is current master. Second is the PR before I made the writing synchronous. The last one is the last commit where I write the entire packet synchronously. Of course this is just a toy experiment and we'd need to run RPS but this confirms my intuition that making things sync here likely won't create much of contention.

@cdmihai logging packets can easily get huge, ~ 5MB, for logging of task inputs, task outputs, item adds and removes (all items with metadata are concatenated into a giant 5MB string and sent across the wire in a single BuildMessageEventArgs). You can see some stats here:
https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/BinLog-Stats

Here's a sample part of a single packet sent by RAR:
https://gist.github.com/KirillOsenkov/8aa84151b92b3837dbd24227e834df58 :)

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

int lengthToWrite = Math.Min((int)writeStream.Length - i, MaxPacketWriteSize);
if ((int)writeStream.Length - i <= MaxPacketWriteSize)
int lengthToWrite = Math.Min(writeStreamLength - i, MaxPacketWriteSize);
if (writeStreamLength - i <= MaxPacketWriteSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if looks redundant now, both branches do the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@KirillOsenkov
Copy link
Member Author

This is not ready yet - we have a real perf regression identified by @cdmihai:

70d3538
21.79
22.31
22.13
22.41
20.58
 
6eb445d (Write the last part of the packet synchronously on Mono.)
22.94
20.68
21.03
22.46
21.54

dc5f1fb7 (Write the entire packet in SendData() synchronously.)
25.29
26.14
23.74
25.42
23.57

Additionally, we have an RPS failure, because we now preallocate two 1 MB byte arrays per worker node, which RPS caught:
image

I'll need to confirm the overall memory savings on larger builds and if it is as significant as I saw, we'll need to ask for an RPS exception to allow us to take an initial hit, in exchange for almost no byte[] allocations down the line in this area.

Also I'll need to look if I can make the entire SendData() async (fire-and-forget) so that we don't block on it finishing. If switching to sync for the last part of the packet has caused such a dramatic slowdown maybe we can squeeze more out of it if we move the entire SendData() call to a different async queue to queue up a packet and return immediately.

@ladipro
Copy link
Member

ladipro commented Jan 22, 2021

Additionally, we have an RPS failure, because we now preallocate two 1 MB byte arrays per worker node

Would it make sense to start with smaller arrays then and pay the resizing cost in scenarios where a lot of data is moved between nodes?

@KirillOsenkov
Copy link
Member Author

@ladipro certainly a simple way to trick RPS ;) I'll try that.

@KirillOsenkov
Copy link
Member Author

reminds me of Volkswagen for some reason

This avoids blocking the main loop. Roughly equivalent to writing the packet asynchronously using fire-and-forget (BeginWrite).
@KirillOsenkov KirillOsenkov force-pushed the dev/kirillo/memoryStream branch from dc5f1fb to c89545c Compare January 25, 2021 03:08
@KirillOsenkov
Copy link
Member Author

I'm experimenting with moving SendData() off the main loop (to keep BuildManager.ProcessWorkQueue() unblocked).
So far I'm seeing decent numbers (slightly better than anything seen so far).

Measuring this is incredibly tricky. I've spent two full days trying out various approaches, measuring various codebases, with various flags (/bl, /graph), debug vs. release, last write sync vs. async, MemoryStreamPool vs. a single memory stream, attempts at optimizing DrainPacketQueue(), etc. The numbers fluctuate and are highly inconclusive. There would be outliers and no clear picture. But the trend does show a little bit, so at least you can compare each pair of approaches.

This is a rough draft, and if the approach looks OK I'll rebase and cleanup the code and add comments and such.

Here are my crazy numbers from the last two days:
image

@KirillOsenkov
Copy link
Member Author

Here are the final numbers I got that matter the most:

70d3538 6eb445d c89545c
23.817 22.403 21.04
22.167 22.893 19.805
21.023 20.327 20.531
22.526 21.806 20.294
18.419 21.037 18.998
19.652 19.931 20.176

70d3538 is the base commit of this PR. 6eb445d reuses MemoryStreams. c89545c is the last commit which makes SendData return immediately and writes the packets on a separate thread asynchronously.

@KirillOsenkov
Copy link
Member Author

I pushed this branch to exp/kirillo/memoryStream to see if RPS is more happy this way

@KirillOsenkov
Copy link
Member Author

Unfortunately this is a no-go for now. On a single threaded build this PR is noticeably slower than master:

master #6023
32.377 38.573
31.758 34.411
31.481 33.057
32.62 33.376
32.495 32.679
32.6 33.203

I'll need to profile and investigate why.

@KirillOsenkov
Copy link
Member Author

KirillOsenkov commented Jan 26, 2021

Hmm, I've remeasured again and now it's on par with master for single-node:

32.847
32.151
32.023
32.876
32.028

I profiled and saw nothing under profiler that I touch. I even turned off Defender for the path.

@KirillOsenkov
Copy link
Member Author

Without /bl and with /m the numbers are ever so slightly more favorable for this PR:

master #6023
11.324 11.057
11.809 10.743
10.833 10.339
12.028 11.245
10.762 10.976
10.359 10.931

@KirillOsenkov KirillOsenkov removed the WIP label Jan 26, 2021
@KirillOsenkov
Copy link
Member Author

This should be ready for final review.

@KirillOsenkov
Copy link
Member Author

@ladipro yup, Tasks were designed to point forward (to continuations), but not backward (a Task doesn’t know what started it and doesn’t hold on to predecessors). So the chain would just fall off and get GCed.

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jan 29, 2021
@davkean
Copy link
Member

davkean commented Feb 8, 2021

@KirillOsenkov This looks awesome. What's holding this up from getting in? Are we targeting 16.9 or 16.10?

@KirillOsenkov
Copy link
Member Author

16.10. I’m waiting for you guys to merge this :)

@davkean
Copy link
Member

davkean commented Feb 8, 2021

Is this targeting 16.10 because of the risk?

@KirillOsenkov
Copy link
Member Author

Yeah, not enough runway for 16.9 to get some thorough testing and dogfooding time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants