Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: use manual masking on state flags to avoid boxing #1197

Merged
merged 2 commits into from
Feb 11, 2022

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Jul 31, 2021

While profiling something unrelated i noticed that the highest allocation was boxing of an enum that was being used with HasFlags. This is netcore so it should be using the optimized code path but as noted in #1054 this isn't happening. This PR changes from using HasFlag to manual masking.

before:
snaphot flags before

after:
snaphot flags after

@JRahnama
Copy link
Contributor

JRahnama commented Aug 4, 2021

LGTM.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Aug 10, 2021

I thought we were all in favor of HasFlag to be the recommended solution in this case.
Do we compromise performance over allocations?

Also, can you link the test you used in this case to compare the difference?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 10, 2021

In netfx flags usage should always be by masking because the compiler generates box instructions and the jit wasn't able to understand the pattern enough to remove them. Marking means that the enum is treated as it's integer base type and those don't require boxing. Here's an example of simple use:

using System;
public class C {
    [Flags]
    public enum MyFlags
    {
        None = 0,
        Flag1 = 2, 
        Flag2 = 4
    }
    
    public bool Masked(MyFlags flags) {
        return (flags & MyFlags.Flag1) == MyFlags.Flag1;
    }
    
    public bool HasFlag(MyFlags flags)
    {
        return flags.HasFlag(MyFlags.Flag1);
    }
}

If you look at the IL generated for the masked case you see this:

        IL_0000: ldarg.1
        IL_0001: box C/MyFlags
        IL_0006: ldc.i4.2
        IL_0007: box C/MyFlags
        IL_000c: call instance bool [System.Private.CoreLib]System.Enum::HasFlag(class [System.Private.CoreLib]System.Enum)
        IL_0011: ret

and those box instructions are the problem, each one is a heap allocation that is short lived.

If you look at the assembly on netfx difference between the mask and hasflags approaches there's a clear difference and masking is superior:

C.Masked(MyFlags)
    L0000: test dl, 0x2
    L0003: setnz al
    L0006: movzx eax, al
    L0009: ret

C.HasFlag(MyFlags)
    L0000: push rdi
    L0001: push rsi
    L0002: sub rsp, 0x28
    L0006: mov esi, edx
    L0008: mov rcx, 0x7ffc50029748
    L0012: call 0x7ffca8682540
    L0017: mov rdi, rax
    L001a: mov [rdi+0x8], esi
    L001d: mov rcx, 0x7ffc50029748
    L0027: call 0x7ffca8682540
    L002c: mov rdx, rax
    L002f: mov dword [rdx+0x8], 0x2
    L0036: mov rcx, rdi
    L0039: call System.Enum.HasFlag(System.Enum)
    L003e: movzx eax, al
    L0041: add rsp, 0x28
    L0045: pop rsi
    L0046: pop rdi
    L0047: ret

In netcore 2.1 (i think, certainly around that timeframe) the jit team make Enum.HasFlag a jit intrinsic which means that the jit has a known pattern that it emits for that instruction sequence and this allows simplification of the surrounding box instruction interpretation by the jit. So the assembly on netcore is considerably better:

C.Masked(MyFlags)
    L0000: test dl, 2
    L0003: setne al
    L0006: movzx eax, al
    L0009: ret

C.HasFlag(MyFlags)
    L0000: test dl, 2
    L0003: setne al
    L0006: movzx eax, al
    L0009: ret

In general value.HasFlag(MyFlags.Flag1) is more easily read and understood than ((value & MyFlags.Flag1) == MyFlags.Flag1). I know I've made the mistake of missing a subtle `!=' in the mask version.
Because of this the advice in netcore is to use HasFlag wherever possible. The code changes in the netcore vesion of this library originate from the work I did when it was part of the corefx repo and was netcore only and at that time I was advised to use HasFlag instead of masking.

Today when profiling under netcore I can see that the HasFlag version is allocating which means that the jit intrinsic is being put into an instruction sequence that means it isn't possible to eliminate the boxes. Changing to masking will eliminate the boxes and will use the mask asm as seen above which is good.

Reproducing the problem in a way that the jit team can sensibly investigate will require running sqlclient on a custom built checked runtime so that the jit debug flags allowing dumping of compilation debugging information are available. I'm capable of doing this in theory but it's a daunting and time consuming process and even if it is fixed in NET7 that wouldn't improve perf on netcore3.1 to net6. Masking will.

So the short answer is that it should work but doesn't for a simples reason that it's hard to track down. Masking reduces allocations improving memory performance and the cost is in readability. I think everyone working on this library is used to dealing with much more complex code patterns than masking vs hasflags.

@roji
Copy link
Member

roji commented Aug 10, 2021

Today when profiling under netcore I can see that the HasFlag version is allocating which means that the jit intrinsic is being put into an instruction sequence that means it isn't possible to eliminate the boxes. Changing to masking will eliminate the boxes and will use the mask asm as seen above which is good.

Are you sure you weren't running in Debug or something? I'm not seeing any allocations with HasFlag, at least not on .NET 6 - see benchmark code below (with MemoryDiagnoser).

I mean, if the idea is to avoid HasFlag because you want to share code between netfx and netcore, and want netfx to be optimized, that's one thing (though as I previously wrote - with the general perf you get from netfx, I do doubt that this would be significant). But if you really think you're seeing a problem with HasFlag on netcore, it's probably worth confirming exactly what's going on before making a decision based on that.

BenchmarkDotNet=v0.13.0, OS=ubuntu 21.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100-rc.1.21379.2
[Host] : .NET 6.0.0 (6.0.21.37802), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.37802), X64 RyuJIT

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
HasFlag 0.2700 ns 0.0094 ns 0.0079 ns - - - -
Bitwise 0.0118 ns 0.0050 ns 0.0041 ns - - - -
Benchmark code
BenchmarkRunner.Run<Program>();

[MemoryDiagnoser]
public class Program
{
    public MyFlags Flags { get; set; }

    [Benchmark]
    public bool HasFlag()
        => Flags.HasFlag(MyFlags.Flag1);

    [Benchmark]
    public bool Bitwise()
        => (Flags & MyFlags.Flag1) == MyFlags.Flag1;
}

[Flags]
public enum MyFlags
{
    None = 0,
    Flag1 = 2,
    Flag2 = 4
}

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 10, 2021

Yes. I'm sure I'm using the release build. I know HasFlag is supposed to generate non boxing code on netcore but I can see that in this case it is generating boxing code. In every non-sqlclient case I've tried it has generated non-boxing code. I've discussed it with some of the runtime team and in order to debug it needs a clear replication which is why I'd need a checked runtime.

This method is portable between netfx and netcore and despite being slightly less readable it will perform well. I don't really understand why I'm getting so much pushback. If you don't trust my profiling results then do your own and see what you find.

@roji
Copy link
Member

roji commented Aug 16, 2021

@Wraith2 regardless of the specific discussion here, from some Npgsql internal discussions (with @NinoFloris and @vonzshik) we have seen some cases where a profiler shows allocations where BDN does not; this may be due to tiered compilation somehow not working under the profiler (but definitely working under BDN). Bottom line: when in doubt, I'd advise confirming that allocations indeed occur via BDN.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 16, 2021

How do I identify if a specific allocation from all the other required allocations is taking place using BDN?

If the profiler(s) aren't working then that needs to be identified and reported. It's their job to get this stuff right. I'm profiling against a release build from the repo build script used through the nuget package so there are no weird debug or visual studio things going on. It's an as-release environment.

@roji
Copy link
Member

roji commented Aug 16, 2021

How do I identify if a specific allocation from all the other required allocations is taking place using BDN?

You're right about that - that needs a close-to-minimal repro, where you can compare two pieces of code which are identical except for the thing you're trying to test (or a really minimal piece of code which should have zero allocations).

If the profiler(s) aren't working then that needs to be identified and reported. It's their job to get this stuff right.

Agreed, but the first step for that is to make a reproducible code sample - which only you can do as you're the one running into it. The moment some sample program using SqlClient shows that particular allocation when profiled, that can be investigated.

Note that I don't have a confirmed case of profiler malfunction, just some possible weirdness.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 16, 2021

@cmeyertons you reproduced a similar behaviour in #1054 , is there any chance you could see if you can repro this one? It's a simple benchmark I'll paste below and I can see it boxing and unboxing SnapshottedStateFlags constantly on netcore:

	[MemoryDiagnoser]
	public class NonQueryBenchmarks
	{
		private string _connectionString = "Data Source=(local);Initial Catalog=Scratch;Trusted_Connection=true;Connect Timeout=1";

		private SqlConnection _connection;
		private SqlCommand _command;
		private CancellationTokenSource _cts;

		[GlobalSetup]
		public void GlobalSetup()
		{
			_connection = new SqlConnection(_connectionString);
			_connection.Open();
			_command = new SqlCommand("SELECT @@version", _connection);
			_cts = new CancellationTokenSource();
		}

		[GlobalCleanup]
		public void GlobalCleanup()
		{
			_connection.Dispose();
			_cts.Dispose();
		}

		[Benchmark]
		public async ValueTask<int> AsyncWithCancellation()
		{
			return await _command.ExecuteNonQueryAsync(_cts.Token);
		}
	}

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 16, 2021

Issue opened on dotMemory: https://youtrack.jetbrains.com/issue/PROF-1157
The answer to getting the disassembly was to add the DisassemblyDiagnoser with a large depth and then sift through the output to find what I wanted. Codegen was/is:

; Microsoft.Data.SqlClient.TdsParserStateObject.get_HasPendingData()
       movzx     eax,byte ptr [rcx+19F]
       test      al,2
       setne     al
       movzx     eax,al
       ret
; Total bytes of code 16

Which is fairly self explanatory and clearly doesn't box (no helper calls).

Regardless of that; I think that this update is portable between netfx and netcore and has only a small reduction in readability so I'd still like to proceed.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 27, 2021

Response from jetbrains and careful investigation of the allocations in certain time windows of the traces show that tiering does eventually give optimized non-boxing code but that it starts off with tier0 boxing. We can just make it non-boxing from the start and do the same on netfx. So masking is still the optimal choice in this situation.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 20, 2021

CI is green, no conflicts. Anyone?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 27, 2021

1 file, 7 lines changed, CI is green, what is blocking this please?

@David-Engel
Copy link
Contributor

1 file, 7 lines changed, CI is green, what is blocking this please?

I'm guessing uncertainty.

(Preface: I'm just here trying to mediate. I don't know whether we should merge this or not, but I do appreciate the contribution.)

From what I understand, this is a reversing of a previous perf/readability change and there is uncertainty around the scope of the perf problem. HasFlags is the forward-looking and recommended pattern for .NET and is much easier to read than masking. If it's incorrectly boxing in some scenarios, it will eventually be fixed to not box in the future, right? Perhaps even in an update? So then some time in the future we would want to reverse this reversal?

I can see that in your run and the scope you are showing, there appears to be a significant difference in allocations. However, what is the overall perf improvement in an end-to-end use case? Are we arguing over tens of thousands of allocations among millions? 10,000 rows/sec versus 10,001 in some hypothetical scenario? I'm honestly not sure which way to argue for this PR. I know we try really hard to squeeze perf out where we can, but given the discussion, it doesn't sound like everyone was convinced.

Regards,
David

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 29, 2021

These flags will be checked multiple times per query and on each field access because the contain various pieces of important state. They are masked together instead of being individual boolean fields for two reasons 1) it's makes the state object smaller and 2) they are copied as a group into async snapshots and the save/restore time is noticeable (but not a dominator) on profiles.

Every field access will need to check if the reader has advanced to the metadata and will cause a boxed enum value on every call to reader.Get* or reader.GetfieldValue<*> so the more you do the more garbage gets generated. They're gen0 allocations and very short lived so they're cleaned up quickly, however the optimal case for garbage collection is for it to never happen. The more time spend in GC the less time you can do user work.

In netfx up to current and last version using enum.HasFlags will box. On netcore 3.1 and later where tiered compilation is available+enabled the tier0 (quickly produced, relatively unoptimized) code generated by the jit will box in the same way as netfx. The tier1 code is produced once a method execution count hits a specific level and the in the tier1 compilation there is an optimization that recognises the HasFlags pattern and removes the box. Manual masking will generate optimal code even at tier0 on netcore and will generated box free code on netfx.

I would prefer to use HasFlags if readability were the only concern. I acknowledge that the manual masking version is slightly less readable but I feel that the slight drawback for the small number of people who work on this code is heavily outweighed by the efficiency that all users of this library gain. It's a case of <10 people taking a few extra seconds to think through the masking vs many thousands of people spending less electricity to query their data.

@roji
Copy link
Member

roji commented Oct 30, 2021

In netfx up to current and last version using enum.HasFlags will box. On netcore 3.1 and later where tiered compilation is available+enabled the tier0 (quickly produced, relatively unoptimized) code generated by the jit will box in the same way as netfx. The tier1 code is produced once a method execution count hits a specific level and the in the tier1 compilation there is an optimization that recognises the HasFlags pattern and removes the box. Manual masking will generate optimal code even at tier0 on netcore and will generated box free code on netfx.

This .NET Core angle doesn't make sense to me. Tiered compilation is a completely standard feature of JITing, and trying to micro-optimize something which only occurs in tier0 is, well, odd. The whole point of tiered compilation is that the slow code gets optimized automatically by the JIT after a few iterations, so any perf gains are completely negligible and don't justify worse readability...

For .NET Framework, if you believe reducing these gen0 allocations justify worse readability, then this may make sense. But I'd argue that .NET Framework is already inefficient in so many ways (compared to .NET Core), that this is again unlikely to matter. Anyone who cares even a little about perf should already be in .NET Core for a thousand reasons which are more important than this allocation.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 30, 2021

On netfx we still use individual with internal access and touch them directly. The change to add the properties and implement them using flags was made in netcore only because at the time it was part of corefx. netfx currently doesn't incur the cost of boxing here and it will be a reduction in performance when the two are merged to use the common codebase.

In general I don't put a lot of effort into netfx perf improvements for exactly the reason you stated, there are bigger problems. Not using an approach which is superior on all targets in various degrees just because it benefits netfx slightly is punitive and discriminatory. There are people who use and will continue to use netfx and if we can make their code more efficient with minor effort on our part we should.

@jkotas
Copy link
Member

jkotas commented Dec 17, 2021

You can define a private bool GetSnapshottedState(...) method that parallels SetSnapshottedState to address the readability concern.

FWIW, Enum.HasFlags vs. manual masking is a personal style choice. I do not think we have a guideline anywhere that HasFlags is the recommended pattern. In fact, the guideline has been exact opposite for a quite while since the JIT was not optimizing it as intrinsic. The JIT in .NET Core/5+ is optimizing it now, but this optimization is not guaranteed to happen. In addition to tiered compilation, there are other situations (e.g. too complex method) where this optimization may not happen. If you want to write a code that is fast by construction, avoid Enum.HasFlags.

@Eli-Black-Work
Copy link

Eli-Black-Work commented Dec 20, 2021

@jkotas Not to digress too much from the main topic, but I'm curious why the Enum.HasFlags optimization in .NET Core/5+ was implemented as a JIT intrinsic instead of a compiler intrinsic? 🙂

@jkotas
Copy link
Member

jkotas commented Dec 20, 2021

I assume that you mean C# compiler (Roslyn) intrinsic. Roslyn has very minimal understanding of the base class library behaviors. It does not expand any methods as intrinsics, with a very few exceptions of methods on core types e.g. == operator on strings. It is intentional design decision to avoid duplicating logic from the runtime libraries.

@Eli-Black-Work
Copy link

Ah, yes, that's what I meant. Got it, thanks! 🙂

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jan 21, 2022

Updated to use an accessor function GetSnapshottedState as suggested. CI is failing on unrelated mars things as if often does. 8 lines, coming up to 6 months to review them. If we're working at one line per month have i just added another 4 months wait time? or are { }'s free in which case it's just 2?

@johnnypham johnnypham added this to the 5.0.0-preview1 milestone Jan 21, 2022
@DavoudEshtehari DavoudEshtehari added 📈 Performance Issues that are targeted to performance improvements. Area\Netcore Issues that are apply only to .NET runtime or the 'netcore' project folder. labels Feb 11, 2022
@DavoudEshtehari DavoudEshtehari merged commit 35d9ffe into dotnet:main Feb 11, 2022
@Wraith2 Wraith2 deleted the perf-snapshotflags branch February 11, 2022 20:59
@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 11, 2022

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area\Netcore Issues that are apply only to .NET runtime or the 'netcore' project folder. 📈 Performance Issues that are targeted to performance improvements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants