-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BinaryWriter perf and memory improvements #47316
BinaryWriter perf and memory improvements #47316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks thank you for another amazing perf improvement!
Could you please extend benchmarks with Write(double)
, Write(float)
and Write(short)
and contribute them to the performance repo? If we merge the benchmarks before this change the perf infra will show improvements (or regressions) for x64, x86, and ARM64.
@@ -15,32 +15,30 @@ namespace System.IO | |||
// | |||
public class BinaryWriter : IDisposable, IAsyncDisposable | |||
{ | |||
private const int MaxArrayPoolRentalSize = 1024 * 1024; // ArrayPool<T>.Shared allocates beyond this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would have been great if ArrayPool
was exposing the value as an internal const
.
public void Ctor_Utf8EncodingDerivedTypeWithWrongCodePage_DoesNotUseFastUtf8() | ||
{ | ||
Mock<UTF8Encoding> mockEncoding = new Mock<UTF8Encoding>(); | ||
mockEncoding.Setup(o => o.CodePage).Returns(65000 /* UTF-7 code page */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is worth it to pick up this heavy dependency here to just save like 3 lines.
For cases where it is really worth, we just need a way how to conditionally disable tests using test techniques that are incompatible with runtime mode (single file, trimming, no JIT, no reflection emit, no private reflection, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Moq in a few other test projects in this repo (see search results). I can remove the dependency for this project, but what does that mean for the general test framework guidance?
See Steve's comment at #47316 (comment) and my response there for a little more context on why I'm using (and mocking) the CodePage
property in the first place. We could tweak that logic and render the whole thing moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using heavy test framework for testing high-level libraries like Microsoft.Extensions.*
is fine.
I do not think it is a good practice to use these heavy test frameworks for testing core platform (ie stuff in CoreLib).
I agree that we should have this test (as long as the implementation stays what it is). Do you agree that the use of Moq saves you like 3 lines on code in this case?
|
||
Assert.Equal(3_000_000_000, outStream.Position); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for writing all the tests! and especially covering this particular edge case! 👍
// We prefer GetMaxByteCount because it's a constant-time operation. | ||
|
||
int maxByteCount = _encoding.GetMaxByteCount(chars.Length); | ||
if (maxByteCount <= MaxArrayPoolRentalSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible (and worth it) to add a stackallock
code path for small char arrays? Similar to what you have done for small strings in Write(string value)
?
@@ -15,32 +15,30 @@ namespace System.IO | |||
// | |||
public class BinaryWriter : IDisposable, IAsyncDisposable | |||
{ | |||
private const int MaxArrayPoolRentalSize = 1024 * 1024; // ArrayPool<T>.Shared allocates beyond this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not really depend on ArrayPool implementation details. It would be better to set this to size where we start to see diminishing results. I would expect it to be like 64kB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, very large buffers tend to not work that well since they do not fit into processor cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to refactor the Stream copy buffer size const out into an internal field, then reference it from here? That would provide a single place to look across our code when it needs to figure out a good default buffer size.
const int DefaultCopyBufferSize = 81920; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. It is not clear whether the DefaultCopyBufferSize
is actually a good default buffer size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original justification for 81920 was that it is right under default LOH threshold and good for GC. This argument does not hold with ArrayPool that was not used originally. ArrayPool will round it up to the next power of 2, so 81920 will turn into 128k that is right above LOH threshold... .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, some time last year I tried decreasing the size so it would be back under the LOH threshold even after the pool rounded up, but there were quite measurable regressions for certain operations on microbenchmarks due to the much smaller buffer size, so I left it as is until we had a pressing scenario highlighting it was worth a change.
src/libraries/System.Private.CoreLib/src/System/IO/BinaryWriter.cs
Outdated
Show resolved
Hide resolved
{ | ||
#if !NETCOREAPP | ||
RuntimeHelpers.PrepareConstrainedRegions(); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a test file... is this really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. But everything is non-shipping code until it gets copied & pasted into a shipping product. :)
src/libraries/System.Private.CoreLib/src/System/IO/BinaryWriter.cs
Outdated
Show resolved
Hide resolved
@adamsitnik I sent a PR at dotnet/performance#1639 with these tests, plus added the tests you suggested + fleshed out a few others. |
@adamsitnik @jkotas I didn't see much of a difference between using a 32k vs. a 64k max rental size. Perf results below.
The strlen = 8192, 64kb buffer test has a large stddev, so I'm not worrying too much about it. The strlen = 16384 test is very different between 32kb and 64kb because on 64kb the data fits into a single buffer, while on 32kb it does not so we need to go down the slow "two-pass" path. I think this means we can stick with 64k. |
src/libraries/System.Private.CoreLib/src/System/IO/BinaryWriter.cs
Outdated
Show resolved
Hide resolved
Android test runner seems to be crashing on the test that attempts to allocate 6.5GB of memory. I have a 01-29 06:08:59.833 7947 8489 I DOTNET : Test collection for System.IO.Tests.BinaryWriter_EncodingTests
01-29 06:08:59.841 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: False, throwOnInvalidBytes: False)
01-29 06:08:59.842 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: True, throwOnInvalidBytes: True)
01-29 06:08:59.842 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: True, throwOnInvalidBytes: False)
01-29 06:08:59.842 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: False, throwOnInvalidBytes: True)
01-29 06:09:00.841 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 262144)
01-29 06:09:00.965 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 32768)
01-29 06:09:00.997 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 8192)
01-29 06:09:01.000 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'é')
01-29 06:09:01.001 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'x')
01-29 06:09:01.002 7947 8489 I DOTNET : [PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'ℰ')
01-29 06:09:03.457 857 857 E lowmemorykiller: Kill 'com.google.android.ims' (5004), uid 10147, oom_adj 999 to free 37460kB
01-29 06:09:03.462 857 857 I lowmemorykiller: Reclaimed 37460kB, cache(318216kB) and free(46940kB)-reserved(45844kB) below min(322560kB) for oom_adj 950
01-29 06:09:03.474 1384 1721 D ConnectivityService: ConnectivityService NetworkRequestInfo binderDied(NetworkRequest [ TRACK_DEFAULT id=35, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED Uid: 10147] ], android.os.BinderProxy@6fbe136)
01-29 06:09:03.475 857 857 E lowmemorykiller: Kill 'com.qualcomm.telephony' (6855), uid 10087, oom_adj 999 to free 25108kB
01-29 06:09:03.475 857 857 I lowmemorykiller: Reclaimed 25108kB, cache(286964kB) and free(43812kB)-reserved(45844kB) below min(322560kB) for oom_adj 950
01-29 06:09:03.476 800 800 I Zygote : Process 5004 exited due to signal 9 (Killed) Is there a recommendation for how I can work around this? Best I can think of is to skip the test on android, but that doesn't seem like the right solution. |
Why not skip that test on Android? Is it likely there will be an Android specific bug in that one jumbo allocating test.. |
@danmosemsft I ended up taking your advice. It just makes me feel dirty to hard-code a platform block rather than to query the environment about whether something will succeed or fail. |
@GrabYourPitchforks I had to do something similar in 5aef85a because Ubuntu 18.04 specifically was more aggressive with the OOM killer. I felt OK about it because the chances of an OS specific Regex bug are very low, and in these specific tests only even lower. |
@GrabYourPitchforks The System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 32)
|
@adamsitnik I guess that's not too surprising for small inputs. The old code allocated small arrays every time, and the new code uses the array pool. There's certainly some overhead from fetching and returning pooled arrays. That said, I don't know a good non-breaking way to resolve this without reintroducing the intermediate allocations. And it looks like So while this might be a regression for this scenario, I think we can say that the regression is small (~10 - 20 ns fixed overhead) and the scenario is rare, so we may want to just swallow it. |
This addresses some low-hanging fruit in the
BinaryWriter
class, reducing overall memory footprint and wall clock time for common operations. It also removed use of the unsafe keyword where possible.I spoke offline with @adamsitnik about the consequences behind changing a bunch of
Steam.Write(byte[], int, int)
call sites to readStream.Write(ROS<byte>)
instead. Technically this could result in worse performance if the wrapped stream doesn't override theROS<byte>
-based overloads, since the default implementations of those overloads will rent from the array pool, copy, and forward to the array-based overloads. But honestly, it's 2021, most of the built-in stream types override these methods correctly, and we're already discussing ways to flag with warnings user-defined types which don't override these methods. I don't think we should handicap the common case of using fully-compliant built-in stream types just on the off-chance somebody might have used a custom type.Perf results:
The WriteChars tests for small values requires further discussion. The original implementation allocates a new array on each invocation, while the new implementation uses the array pool. The indirection through the array pool adds a few nanoseconds fixed overhead, which causes the ratio difference between the old and new code to be exaggerated. I believe the new code is more appropriate for the common case since it reduces the overall memory footprint of the application, even with this overhead. There is also some overhead due to the delegate invocation in the workhorse routine. When the delegate is first created, it points to a stub routine rather than directly to the target method, adding a few extra jumps. This is a long-standing behavioral nit in delegates and if it's solved all-up in the runtime then we'll just get the benefits here for free.
Benchmark code below.