Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinaryWriter perf and memory improvements #47316

Merged
merged 8 commits into from
Jan 29, 2021

Conversation

GrabYourPitchforks
Copy link
Member

This addresses some low-hanging fruit in the BinaryWriter class, reducing overall memory footprint and wall clock time for common operations. It also removed use of the unsafe keyword where possible.

I spoke offline with @adamsitnik about the consequences behind changing a bunch of Steam.Write(byte[], int, int) call sites to read Stream.Write(ROS<byte>) instead. Technically this could result in worse performance if the wrapped stream doesn't override the ROS<byte>-based overloads, since the default implementations of those overloads will rent from the array pool, copy, and forward to the array-based overloads. But honestly, it's 2021, most of the built-in stream types override these methods correctly, and we're already discussing ways to flag with warnings user-defined types which don't override these methods. I don't think we should handicap the common case of using fully-compliant built-in stream types just on the off-chance somebody might have used a custom type.

Perf results:

Method Job Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
DefaultCtor Job-LUELTV master 28.972 ns 0.5700 ns 0.5332 ns 1.00 0.0181 - - 152 B
DefaultCtor Job-GIFVTY pr 13.240 ns 0.3233 ns 0.3594 ns 0.46 0.0048 - - 40 B
WriteUInt32 Job-LUELTV master 2.665 ns 0.0159 ns 0.0133 ns 1.00 - - - -
WriteUInt32 Job-GIFVTY pr 2.090 ns 0.0352 ns 0.0294 ns 0.78 - - - -
WriteUInt64 Job-LUELTV master 1.876 ns 0.0130 ns 0.0109 ns 1.00 - - - -
WriteUInt64 Job-GIFVTY pr 2.101 ns 0.0137 ns 0.0121 ns 1.12 - - - -
Method Job Toolchain StringLengthInChars Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
WriteCharArray Job-XZHCJM master 4 24.38 ns 0.208 ns 0.194 ns 24.35 ns 1.00 0.00 0.0038 - - 32 B
WriteCharArray Job-EAPOXA pr 4 35.05 ns 0.113 ns 0.100 ns 35.02 ns 1.44 0.01 - - - -
WriteString Job-XZHCJM master 4 20.39 ns 0.066 ns 0.058 ns 20.39 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 4 14.80 ns 0.034 ns 0.026 ns 14.81 ns 0.73 0.00 - - - -
WriteCharArray Job-XZHCJM master 16 27.92 ns 0.176 ns 0.165 ns 27.93 ns 1.00 0.00 0.0048 - - 40 B
WriteCharArray Job-EAPOXA pr 16 35.11 ns 0.115 ns 0.108 ns 35.14 ns 1.26 0.01 - - - -
WriteString Job-XZHCJM master 16 24.15 ns 0.088 ns 0.078 ns 24.13 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 16 16.96 ns 0.111 ns 0.098 ns 16.93 ns 0.70 0.00 - - - -
WriteCharArray Job-XZHCJM master 512 75.80 ns 0.923 ns 0.864 ns 75.51 ns 1.00 0.00 0.0640 - - 536 B
WriteCharArray Job-EAPOXA pr 512 52.93 ns 0.242 ns 0.227 ns 52.92 ns 0.70 0.01 - - - -
WriteString Job-XZHCJM master 512 137.70 ns 0.903 ns 0.845 ns 137.74 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 512 52.54 ns 0.537 ns 0.449 ns 52.53 ns 0.38 0.00 - - - -
WriteCharArray Job-XZHCJM master 10000 1,044.99 ns 26.915 ns 78.936 ns 1,023.76 ns 1.00 0.00 1.1959 - - 10024 B
WriteCharArray Job-EAPOXA pr 10000 358.84 ns 1.177 ns 1.044 ns 358.70 ns 0.35 0.02 - - - -
WriteString Job-XZHCJM master 10000 2,748.17 ns 24.340 ns 22.767 ns 2,748.62 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 10000 352.63 ns 1.601 ns 1.498 ns 353.18 ns 0.13 0.00 - - - -
WriteCharArray Job-XZHCJM master 100000 39,499.04 ns 784.557 ns 1,894.794 ns 39,615.28 ns 1.00 0.00 31.1890 31.1890 31.1890 100025 B
WriteCharArray Job-EAPOXA pr 100000 3,193.04 ns 18.645 ns 17.441 ns 3,192.28 ns 0.08 0.00 - - - -
WriteString Job-XZHCJM master 100000 27,025.32 ns 139.589 ns 123.742 ns 26,993.91 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 100000 3,483.60 ns 20.989 ns 19.633 ns 3,485.38 ns 0.13 0.00 - - - -
WriteCharArray Job-XZHCJM master 500000 90,273.26 ns 1,036.222 ns 918.584 ns 90,217.61 ns 1.00 0.00 113.7695 113.7695 113.7695 500021 B
WriteCharArray Job-EAPOXA pr 500000 17,491.07 ns 69.793 ns 65.285 ns 17,526.21 ns 0.19 0.00 - - - 48 B
WriteString Job-XZHCJM master 500000 137,453.00 ns 649.887 ns 607.905 ns 137,337.48 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 500000 30,467.77 ns 182.114 ns 170.350 ns 30,530.05 ns 0.22 0.00 - - - 48 B
WriteCharArray Job-XZHCJM master 2000000 766,902.63 ns 15,059.390 ns 23,445.656 ns 769,048.83 ns 1.00 0.00 95.7031 95.7031 95.7031 2000032 B
WriteCharArray Job-EAPOXA pr 2000000 71,244.09 ns 236.232 ns 209.414 ns 71,242.82 ns 0.09 0.00 - - - 160 B
WriteString Job-XZHCJM master 2000000 645,989.09 ns 4,823.895 ns 4,512.275 ns 647,063.96 ns 1.00 0.00 - - - -
WriteString Job-EAPOXA pr 2000000 126,507.40 ns 719.938 ns 673.430 ns 126,473.10 ns 0.20 0.00 - - - 160 B

The WriteChars tests for small values requires further discussion. The original implementation allocates a new array on each invocation, while the new implementation uses the array pool. The indirection through the array pool adds a few nanoseconds fixed overhead, which causes the ratio difference between the old and new code to be exaggerated. I believe the new code is more appropriate for the common case since it reduces the overall memory footprint of the application, even with this overhead. There is also some overhead due to the delegate invocation in the workhorse routine. When the delegate is first created, it points to a stub routine rather than directly to the target method, adding a few extra jumps. This is a long-standing behavioral nit in delegates and if it's solved all-up in the runtime then we'll just get the benefits here for free.

Benchmark code below.

using BenchmarkDotNet.Attributes;
using System;
using System.IO;
using System.Threading;
using System.Threading.Tasks;

namespace ConsoleAppBenchmark
{
    [MemoryDiagnoser]
    public class BinaryWriterRunner
    {
        private BinaryWriter _bw;

        [GlobalSetup]
        public void Setup()
        {
            _bw = new BinaryWriter(new NullWriteStream());
        }

        [Benchmark]
        public BinaryWriter DefaultCtor() => new BinaryWriter(Stream.Null);

        [Benchmark]
        public void WriteUInt32()
        {
            _bw.Write((uint)0xdeadbeef);
        }

        [Benchmark]
        public void WriteUInt64()
        {
            _bw.Write((ulong)0xdeadbeef_aabbccdd);
        }
    }

    [MemoryDiagnoser]
    public class BinaryWriterRunner_Extended
    {
        private string _input;
        private char[] _inputAsChars;
        private readonly BinaryWriter _bw;

        [Params(4, 16, 512, 10_000, 100_000, 500_000, 2_000_000)]
        public int StringLengthInChars;

        public BinaryWriterRunner_Extended()
        {
            _bw = new BinaryWriter(new NullWriteStream());
        }

        [GlobalSetup]
        public void Setup()
        {
            _input = new string('x', StringLengthInChars);
            _inputAsChars = _input.ToCharArray();
        }

        [Benchmark]
        public void WriteCharArray()
        {
            _bw.Write(_inputAsChars);
        }

        [Benchmark]
        public void WriteString()
        {
            _bw.Write(_input);
        }
    }

    internal class NullWriteStream : Stream
    {
        public override bool CanRead => false;

        public override bool CanSeek => false;

        public override bool CanWrite => true;

        public override long Length => throw new NotSupportedException();

        public override long Position { get => throw new NotSupportedException(); set => throw new NotSupportedException(); }

        public override void Flush() { }

        public override int Read(byte[] buffer, int offset, int count)
        {
            throw new NotSupportedException();
        }

        public override long Seek(long offset, SeekOrigin origin)
        {
            throw new NotSupportedException();
        }

        public override void SetLength(long value)
        {
            throw new NotSupportedException();
        }

        public override void Write(byte[] buffer, int offset, int count) { }

        public override void Write(ReadOnlySpan<byte> buffer) { }

        public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
        {
            return Task.CompletedTask;
        }

        public override void WriteByte(byte value) { }

        public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default)
        {
            return ValueTask.CompletedTask;
        }
    }
}

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks thank you for another amazing perf improvement!

Could you please extend benchmarks with Write(double), Write(float) and Write(short) and contribute them to the performance repo? If we merge the benchmarks before this change the perf infra will show improvements (or regressions) for x64, x86, and ARM64.

@@ -15,32 +15,30 @@ namespace System.IO
//
public class BinaryWriter : IDisposable, IAsyncDisposable
{
private const int MaxArrayPoolRentalSize = 1024 * 1024; // ArrayPool<T>.Shared allocates beyond this point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would have been great if ArrayPool was exposing the value as an internal const.

public void Ctor_Utf8EncodingDerivedTypeWithWrongCodePage_DoesNotUseFastUtf8()
{
Mock<UTF8Encoding> mockEncoding = new Mock<UTF8Encoding>();
mockEncoding.Setup(o => o.CodePage).Returns(65000 /* UTF-7 code page */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with using Mock<T> as long as we don't have any AOT test suite that is going to fail.

obraz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is worth it to pick up this heavy dependency here to just save like 3 lines.

For cases where it is really worth, we just need a way how to conditionally disable tests using test techniques that are incompatible with runtime mode (single file, trimming, no JIT, no reflection emit, no private reflection, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Moq in a few other test projects in this repo (see search results). I can remove the dependency for this project, but what does that mean for the general test framework guidance?

See Steve's comment at #47316 (comment) and my response there for a little more context on why I'm using (and mocking) the CodePage property in the first place. We could tweak that logic and render the whole thing moot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using heavy test framework for testing high-level libraries like Microsoft.Extensions.* is fine.

I do not think it is a good practice to use these heavy test frameworks for testing core platform (ie stuff in CoreLib).

I agree that we should have this test (as long as the implementation stays what it is). Do you agree that the use of Moq saves you like 3 lines on code in this case?


Assert.Equal(3_000_000_000, outStream.Position);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for writing all the tests! and especially covering this particular edge case! 👍

// We prefer GetMaxByteCount because it's a constant-time operation.

int maxByteCount = _encoding.GetMaxByteCount(chars.Length);
if (maxByteCount <= MaxArrayPoolRentalSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible (and worth it) to add a stackallock code path for small char arrays? Similar to what you have done for small strings in Write(string value)?

@@ -15,32 +15,30 @@ namespace System.IO
//
public class BinaryWriter : IDisposable, IAsyncDisposable
{
private const int MaxArrayPoolRentalSize = 1024 * 1024; // ArrayPool<T>.Shared allocates beyond this point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not really depend on ArrayPool implementation details. It would be better to set this to size where we start to see diminishing results. I would expect it to be like 64kB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, very large buffers tend to not work that well since they do not fit into processor cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to refactor the Stream copy buffer size const out into an internal field, then reference it from here? That would provide a single place to look across our code when it needs to figure out a good default buffer size.

const int DefaultCopyBufferSize = 81920;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. It is not clear whether the DefaultCopyBufferSize is actually a good default buffer size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original justification for 81920 was that it is right under default LOH threshold and good for GC. This argument does not hold with ArrayPool that was not used originally. ArrayPool will round it up to the next power of 2, so 81920 will turn into 128k that is right above LOH threshold... .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, some time last year I tried decreasing the size so it would be back under the LOH threshold even after the pool rounded up, but there were quite measurable regressions for certain operations on microbenchmarks due to the much smaller buffer size, so I left it as is until we had a pressing scenario highlighting it was worth a change.

{
#if !NETCOREAPP
RuntimeHelpers.PrepareConstrainedRegions();
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a test file... is this really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. But everything is non-shipping code until it gets copied & pasted into a shipping product. :)

@GrabYourPitchforks
Copy link
Member Author

@adamsitnik I sent a PR at dotnet/performance#1639 with these tests, plus added the tests you suggested + fleshed out a few others.

@GrabYourPitchforks
Copy link
Member Author

@adamsitnik @jkotas I didn't see much of a difference between using a 32k vs. a 64k max rental size. Perf results below.

Method Toolchain StringLengthInChars Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
WriteCharArray pr32k 4 33.00 ns 1.138 ns 0.062 ns 0.99 0.01 - - - -
WriteCharArray pr64k 4 33.27 ns 2.128 ns 0.117 ns 1.00 0.00 - - - -
WriteString pr32k 4 12.04 ns 0.612 ns 0.034 ns 0.99 0.00 - - - -
WriteString pr64k 4 12.12 ns 1.073 ns 0.059 ns 1.00 0.00 - - - -
WriteCharArray pr32k 16 34.69 ns 0.626 ns 0.034 ns 1.00 0.01 - - - -
WriteCharArray pr64k 16 34.85 ns 7.338 ns 0.402 ns 1.00 0.00 - - - -
WriteString pr32k 16 13.27 ns 0.396 ns 0.022 ns 1.00 0.00 - - - -
WriteString pr64k 16 13.23 ns 0.745 ns 0.041 ns 1.00 0.00 - - - -
WriteCharArray pr32k 512 59.53 ns 5.954 ns 0.326 ns 1.00 0.04 - - - -
WriteCharArray pr64k 512 59.61 ns 37.954 ns 2.080 ns 1.00 0.00 - - - -
WriteString pr32k 512 51.82 ns 4.056 ns 0.222 ns 0.99 0.01 - - - -
WriteString pr64k 512 52.21 ns 2.390 ns 0.131 ns 1.00 0.00 - - - -
WriteCharArray pr32k 8192 302.99 ns 9.689 ns 0.531 ns 1.02 0.00 - - - -
WriteCharArray pr64k 8192 295.62 ns 8.826 ns 0.484 ns 1.00 0.00 - - - -
WriteString pr32k 8192 296.68 ns 47.442 ns 2.600 ns 0.93 0.11 - - - -
WriteString pr64k 8192 320.96 ns 701.383 ns 38.445 ns 1.00 0.00 - - - -
WriteCharArray pr32k 16384 632.55 ns 9.539 ns 0.523 ns 1.03 0.01 0.0057 - - 48 B
WriteCharArray pr64k 16384 614.72 ns 156.201 ns 8.562 ns 1.00 0.00 - - - -
WriteString pr32k 16384 1,054.91 ns 93.902 ns 5.147 ns 1.67 0.01 0.0057 - - 48 B
WriteString pr64k 16384 630.59 ns 21.631 ns 1.186 ns 1.00 0.00 - - - -
WriteCharArray pr32k 131072 4,646.23 ns 182.919 ns 10.026 ns 1.05 0.00 0.0153 - - 160 B
WriteCharArray pr64k 131072 4,427.34 ns 126.210 ns 6.918 ns 1.00 0.00 0.0153 - - 160 B
WriteString pr32k 131072 8,025.57 ns 377.238 ns 20.678 ns 1.00 0.00 0.0153 - - 160 B
WriteString pr64k 131072 8,025.43 ns 364.648 ns 19.988 ns 1.00 0.00 0.0153 - - 160 B
WriteCharArray pr32k 1048576 38,643.52 ns 9,283.063 ns 508.836 ns 1.03 0.02 - - - 160 B
WriteCharArray pr64k 1048576 37,528.87 ns 6,376.444 ns 349.514 ns 1.00 0.00 - - - 160 B
WriteString pr32k 1048576 68,994.16 ns 4,548.266 ns 249.306 ns 1.03 0.01 - - - 160 B
WriteString pr64k 1048576 67,256.93 ns 3,381.513 ns 185.352 ns 1.00 0.00 - - - 161 B

The strlen = 8192, 64kb buffer test has a large stddev, so I'm not worrying too much about it. The strlen = 16384 test is very different between 32kb and 64kb because on 64kb the data fits into a single buffer, while on 32kb it does not so we need to go down the slow "two-pass" path.

I think this means we can stick with 64k.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jan 29, 2021

Android test runner seems to be crashing on the test that attempts to allocate 6.5GB of memory. I have a try / catch (OOM) around the test itself to bail if we're in a low-mem condition, but looks like the code's dying before it even gets to this point. The watchdog looks like it's killing other processes on the box, which eventually results in the entire test infrastructure falling over.

01-29 06:08:59.833  7947  8489 I DOTNET  : Test collection for System.IO.Tests.BinaryWriter_EncodingTests
01-29 06:08:59.841  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: False, throwOnInvalidBytes: False)
01-29 06:08:59.842  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: True, throwOnInvalidBytes: True)
01-29 06:08:59.842  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: True, throwOnInvalidBytes: False)
01-29 06:08:59.842  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.Ctor_NewUtf8Encoding_UsesFastUtf8(emitIdentifier: False, throwOnInvalidBytes: True)
01-29 06:09:00.841  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 262144)
01-29 06:09:00.965  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 32768)
01-29 06:09:00.997  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteChars_FastUtf8(stringLengthInChars: 8192)
01-29 06:09:01.000  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'é')
01-29 06:09:01.001  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'x')
01-29 06:09:01.002  7947  8489 I DOTNET  : 	[PASS] System.IO.Tests.BinaryWriter_EncodingTests.WriteSingleChar_FastUtf8(ch: 'ℰ')
01-29 06:09:03.457   857   857 E lowmemorykiller: Kill 'com.google.android.ims' (5004), uid 10147, oom_adj 999 to free 37460kB
01-29 06:09:03.462   857   857 I lowmemorykiller: Reclaimed 37460kB, cache(318216kB) and free(46940kB)-reserved(45844kB) below min(322560kB) for oom_adj 950
01-29 06:09:03.474  1384  1721 D ConnectivityService: ConnectivityService NetworkRequestInfo binderDied(NetworkRequest [ TRACK_DEFAULT id=35, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED Uid: 10147] ], android.os.BinderProxy@6fbe136)
01-29 06:09:03.475   857   857 E lowmemorykiller: Kill 'com.qualcomm.telephony' (6855), uid 10087, oom_adj 999 to free 25108kB
01-29 06:09:03.475   857   857 I lowmemorykiller: Reclaimed 25108kB, cache(286964kB) and free(43812kB)-reserved(45844kB) below min(322560kB) for oom_adj 950
01-29 06:09:03.476   800   800 I Zygote  : Process 5004 exited due to signal 9 (Killed)

Is there a recommendation for how I can work around this? Best I can think of is to skip the test on android, but that doesn't seem like the right solution.

@danmoseley
Copy link
Member

Why not skip that test on Android? Is it likely there will be an Android specific bug in that one jumbo allocating test..

@adamsitnik adamsitnik added this to the 6.0.0 milestone Jan 29, 2021
@GrabYourPitchforks
Copy link
Member Author

@danmosemsft I ended up taking your advice. It just makes me feel dirty to hard-code a platform block rather than to query the environment about whether something will succeed or fail.

@danmoseley
Copy link
Member

@GrabYourPitchforks I had to do something similar in 5aef85a because Ubuntu 18.04 specifically was more aggressive with the OOM killer. I felt OK about it because the chances of an OS specific Regex bug are very low, and in these specific tests only even lower.

@GrabYourPitchforks GrabYourPitchforks merged commit 55a5a0c into dotnet:master Jan 29, 2021
@GrabYourPitchforks GrabYourPitchforks deleted the binarywriter branch January 29, 2021 20:06
@ghost ghost locked as resolved and limited conversation to collaborators Feb 28, 2021
@adamsitnik
Copy link
Member

@GrabYourPitchforks The WriteAsciiCharArray benchmark has regressed for small inputs, I assume that this was a by-design tradeoff?

System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 32)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 38.12 41.05 0.93 -56 Windows 10.0.19042 X64 AMD Ryzen Threadripper 2990WX 5.0.421.11614 6.0.21.16201
Slower 27.45 38.27 0.72 -56 Windows 10.0.21337 X64 AMD Ryzen 9 3900X 5.0.421.11614 6.0.21.16701
Slower 27.98 40.61 0.69 -56 Windows 10.0.21337 X64 AMD Ryzen Threadripper 3990X 5.0.421.11614 6.0.21.16701
Slower 34.75 45.44 0.76 -56 Windows 10.0.18363.1440 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.421.11614 6.0.21.16201
Same 316.55 223.49 1.42 -56 bimodal Windows 10.0.21337 X64 Intel Core i5-4300U CPU 1.90GHz (Haswell) 5.0.421.11614 6.0.21.16701
Slower 30.64 38.92 0.79 -56 Windows 10.0.19042 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.421.11614 6.0.21.16201
Slower 31.43 41.54 0.76 -56 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.421.11614 6.0.21.16201
Same 41.36 45.42 0.91 -56 bimodal Windows 10.0.19042 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.421.11614 6.0.21.16408
Slower 28.70 37.92 0.76 -56 Windows 10.0.19042 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.421.11614 6.0.21.16201
Slower 28.69 34.34 0.84 -56 Windows 10.0.19042 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.421.11614 6.0.21.16408
Same 153.43 166.67 0.92 -56 Windows 10.0.19042 X64 Intel Atom x7-Z8700 CPU 1.60GHz 5.0.421.11614 6.0.21.16309
Slower 39.35 45.83 0.86 -56 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.421.11614 6.0.21.16309
Slower 37.91 49.86 0.76 -56 alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.421.11614 6.0.21.16601
Slower 113.27 135.60 0.84 -56 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.17806
Slower 115.08 139.23 0.83 -56 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.17806
Slower 49.36 66.11 0.75 -56 Windows 10.0.19042 Arm64 Microsoft SQ1 3.0 GHz 5.0.421.11614 6.0.21.16309
Slower 51.67 67.08 0.77 -56 Windows 10.0.19042 Arm64 Microsoft SQ1 3.0 GHz 5.0.421.11614 6.0.21.16201
Slower 44.80 55.94 0.80 -44 Windows 10.0.18363.1440 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.421.11614 6.0.21.16701
Slower 94.83 108.70 0.87 -44 Windows 10.0.19042.867 Arm Microsoft SQ1 3.0 GHz 5.0.421.11614 6.0.21.17905
Slower 53.46 62.62 0.85 -56 macOS 11.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.421.11614 6.0.21.16408
Slower 54.03 61.54 0.88 -56 macOS 11.2.3 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.421.11614 6.0.21.16601
Slower 44.18 51.20 0.86 -56 macOS 11.2.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.421.11614 6.0.21.16601
Slower 45.26 57.47 0.79 -56 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.421.11614 6.0.21.16309

@GrabYourPitchforks
Copy link
Member Author

@adamsitnik I guess that's not too surprising for small inputs. The old code allocated small arrays every time, and the new code uses the array pool. There's certainly some overhead from fetching and returning pooled arrays.

That said, I don't know a good non-breaking way to resolve this without reintroducing the intermediate allocations. And it looks like BinaryWriter.Write(char[]) is a very infrequently used API anyway. (This makes sense given that the calling pattern to get this back in via BinaryReader.Read(char[], ...) is sloppy, so it doesn't surprise me that very few people in practice actually do this.)

So while this might be a regression for this scenario, I think we can say that the regression is small (~10 - 20 ns fixed overhead) and the scenario is rare, so we may want to just swallow it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants