Skip to content

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 13, 2025

Inline write barriers with covariance checks (in case if inliner decides that it's profitable) - this allows to remove redundant write barriers (previously we could only do that in early phases) and range checks (previously they were inside the helper call and now JIT can fold them).

Related: #9159

Benchmark

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    static object[] _strings = new string[4];
    static object _x = "";

    [Benchmark]
    public void AssignString()
    {
        var arr = _strings;
        if (arr.Length >= 4)
        {
            // We now can remove write barriers and redundant range checks if StelemRef gets inlined
            arr[0] = "";
            arr[1] = "";
            arr[2] = "";
            arr[3] = "";
        }
    }

    [Benchmark]
    public void SwapElements()
    {
        var arr = _strings;
        (arr[1], arr[0]) = (arr[0], arr[1]);
    }

    [Benchmark]
    public void SingleAssignmentCns()
    {
        _strings[0] = "";
    }

    [Benchmark]
    public void SingleAssignmentVar()
    {
        _strings[0] = _x;
    }
}

Linux-AMD (Genoa):

Method Toolchain Mean Error Ratio
AssignString Main 7.5633 ns 0.0022 ns 1.00
AssignString PR 1.3160 ns 0.0033 ns 0.17
SwapElements Main 2.9979 ns 0.0008 ns 1.00
SwapElements PR 1.0471 ns 0.0012 ns 0.35
SingleAssignmentCns Main 1.9075 ns 0.0005 ns 1.00
SingleAssignmentCns PR 0.2723 ns 0.0002 ns 0.14
SingleAssignmentVar Main 1.9080 ns 0.0003 ns 1.00
SingleAssignmentVar PR 1.6356 ns 0.0005 ns 0.86

Linux-ARM64 (Cobalt100):

Method Toolchain Mean Error Ratio
AssignString Main 10.3296 ns 0.0032 ns 1.00
AssignString PR 1.8738 ns 0.0006 ns 0.18
SwapElements Main 2.6604 ns 0.0012 ns 1.00
SwapElements PR 1.1768 ns 0.0024 ns 0.44
SingleAssignmentCns Main 1.9655 ns 0.0023 ns 1.00
SingleAssignmentCns PR 0.3821 ns 0.0003 ns 0.19
SingleAssignmentVar Main 1.8030 ns 0.0013 ns 1.00
SingleAssignmentVar PR 0.7219 ns 0.0006 ns 0.40

@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

@MihuBot

@EgorBo

This comment was marked as resolved.

@jkotas
Copy link
Member

jkotas commented Jul 13, 2025

JIT should still see that the target is on the heap

Right, you would have to fix the JIT to be able to prove this in order to make this change without introducing regressions.

we might in the future, so the barrier will need to be made checked.

The barrier needs to be made checked when executed against stack allocated copy. It would be tough tradeoff to de-optimize all write barriers to enable more stack allocation.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

Right, you would have to fix the JIT to be able to prove this in order to make this change without introducing regressions.

I believe I've already done that in this PR. The previous logic was a bit conservative - it tried to handle only knownHeapAddr + cns while in fact I think we can safely say that if we have object assignment for knownHeapAddr + anything we can use unchecked WB too (it's UB if the destination is not on the heap).

@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

Although, this one can't be made unchecked by the JIT due to NoInlining:

image

And if I inline it, then StelemRef becomes too big and I was planning to inline it in the JIT (although, not sure, might be inlined in fact)

@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

Ugh. another problem is that JIT never does tail-calling for helper-calls 😢

image

@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

@MihuBot -arm

@EgorBo EgorBo changed the title Remove WriteBarrier FCall Inline CORINFO_HELP_ARRADDR_ST helper call, remove WriteBarrier FCall Jul 13, 2025
@EgorBo
Copy link
Member Author

EgorBo commented Jul 13, 2025

@EgorBot -amd -arm -windows_intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    static object[] _strings = new string[4];

    [Benchmark]
    public void AssignString()
    {
        var arr = _strings;
        if (arr.Length >= 4)
        {
            arr[0] = "";
            arr[1] = "";
            arr[2] = "";
            arr[3] = "";
        }
    }

    [Benchmark]
    public void SwapElements()
    {
        var arr = _strings;
        (arr[1], arr[0]) = (arr[0], arr[1]);
    }

    [Benchmark]
    public void SingleAssignment()
    {
        _strings[0] = "";
    }
}

@EgorBo EgorBo force-pushed the remove-wb-fcall branch from db666a2 to 872e969 Compare July 16, 2025 01:52
@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

Now that Main is .net 11, can we take a look at this?
PTAL @jkotas for non-JIT side and @jakobbotsch for JIT

@EgorBo EgorBo requested a review from jakobbotsch September 3, 2025 13:24
Comment on lines 10871 to 10878
if (strcmp(className, "TypeCast") == 0)
{
if (strcmp(methodName, "WriteBarrier") == 0)
{
result = NI_System_Runtime_CompilerServices_RuntimeHelpers_WriteBarrier;
}
}
else if (strcmp(namespaceName, "CompilerServices") == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (strcmp(className, "TypeCast") == 0)
{
if (strcmp(methodName, "WriteBarrier") == 0)
{
result = NI_System_Runtime_CompilerServices_RuntimeHelpers_WriteBarrier;
}
}
else if (strcmp(namespaceName, "CompilerServices") == 0)
if (strcmp(namespaceName, "CompilerServices") == 0)

And move the WriteBarrier into RuntimeHelpers for NAOT instead? We prefer things to be the same between NAOT and non-NAOT where possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas I could not use [Intrinsic] for RuntimeHelpers inside Runtime.Base (and presumably in Test.CoreLib) 🙁 so this was exactly why I ended up with this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ignore me, I think I fixed it 🙂

@jkotas
Copy link
Member

jkotas commented Sep 3, 2025

Closes #9159

This issue talks about removing store covariance check. I do not see it fixed by this change.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2025

in case if inliner decides that it's profitable

Does this optimization ever kicks in for NAOT? Or does it only ever kick in with runtime collected PGO data?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

Closes #9159

This issue talks about removing store covariance check. I do not see it fixed by this change.

it is fixed by this change, once it inlines, it is smart enough to handle that, e.g example from that issue:

object temp = a[i];
a[i] = a[j];        // store check optimized by jit to simple write barrier
a[j] = temp;        // store check not optimized

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

in case if inliner decides that it's profitable

Does this optimization ever kicks in for NAOT? Or does it only ever kick in with runtime collected PGO data?

this optimization relies on Inliner to decide whether to inline the routine or not, and it turns out it needs a PGO to convince itself inlining large methods like this is profitable on hot paths. So it does need a profile on NAOT too. We may consider making inliner aggressive enough without PGO for PreferSpeed on NAOT.

…timeHelpers.cs

Co-authored-by: Jan Kotas <jkotas@microsoft.com>
@jkotas
Copy link
Member

jkotas commented Sep 3, 2025

it is fixed by this change, once it inlines, it is smart enough to handle that, e.g example from that issue:

What's the code for this method with this change?

[MethodImpl(MethodImplOptions.NoInlining)]
void Swap(object[] a, int i, int j)
{
    object temp = a[i];
    a[i] = a[j];        // store check optimized by jit to simple write barrier
    a[j] = temp;        // store check not optimized
}

There are two array store invariance checks in this method. One of them is optimized before this change. Is this change enabling the second one to be optimized (the improvement tracked by #9159)? I do not see what makes it possible.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

What's the code for this method with this change?

The codegen (win-x64) for this method is:

G_M39739_IG01:  ;; offset=0x0000
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx
       mov      esi, r8d
						;; size=14 bbWeight=1 PerfScore 4.75
G_M39739_IG02:  ;; offset=0x000E
       mov      edi, dword ptr [rbx+0x08]
       cmp      edx, edi
       jae      SHORT G_M39739_IG09
       mov      ecx, edx
       mov      rbp, gword ptr [rbx+8*rcx+0x10]
       cmp      esi, edi
       jae      SHORT G_M39739_IG09
       mov      edx, esi
       mov      rdx, gword ptr [rbx+8*rdx+0x10]
       lea      rcx, bword ptr [rbx+8*rcx+0x10]
       call     CORINFO_HELP_ASSIGN_REF
       mov      ecx, esi
       mov      edx, edi
       cmp      rdx, rcx
       jbe      SHORT G_M39739_IG06
       lea      rcx, bword ptr [rbx+8*rcx+0x10]
       mov      rdx, qword ptr [rbx]
       mov      rdx, qword ptr [rdx+0x30]
       test     rbp, rbp
       je       SHORT G_M39739_IG07
       cmp      rdx, qword ptr [rbp]
       jne      SHORT G_M39739_IG08
						;; size=67 bbWeight=1 PerfScore 23.00
G_M39739_IG03:  ;; offset=0x0051
       mov      rdx, rbp
       call     CORINFO_HELP_ASSIGN_REF
						;; size=8 bbWeight=1 PerfScore 1.25
G_M39739_IG04:  ;; offset=0x0059
       nop      
						;; size=1 bbWeight=1 PerfScore 0.25
G_M39739_IG05:  ;; offset=0x005A
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=9 bbWeight=1 PerfScore 3.25
G_M39739_IG06:  ;; offset=0x0063
       call     [System.Runtime.CompilerServices.CastHelpers:ThrowIndexOutOfRangeException()]
       int3     
						;; size=7 bbWeight=0 PerfScore 0.00
G_M39739_IG07:  ;; offset=0x006A
       xor      rax, rax
       mov      gword ptr [rcx], rax
       jmp      SHORT G_M39739_IG04
						;; size=7 bbWeight=0 PerfScore 0.00
G_M39739_IG08:  ;; offset=0x0071
       mov      r8, 0x7FF8FA732308      ; System.Object[]
       cmp      qword ptr [rbx], r8
       je       SHORT G_M39739_IG03
       mov      r8, rbp
       call     [System.Runtime.CompilerServices.CastHelpers:StelemRef_Helper(byref,ptr,System.Object)]
       jmp      SHORT G_M39739_IG04
						;; size=26 bbWeight=0 PerfScore 0.00
G_M39739_IG09:  ;; offset=0x008B
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00

; Total bytes of code 145

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

I guess it's not entirely eliminated and the if (elementType != RuntimeHelpers.GetMethodTable(obj)) check and probably some things can be better CSE'd but it makes it 2-3x faster in microbenchmarks. The element type check might require teaching JIT about GetMethodTable(array)->ElementType field

@jkotas
Copy link
Member

jkotas commented Sep 3, 2025

In order to eliminate the second invariant array store check, the JIT would have to figure out what the value came from the same array. Helper inlining won't help you with that.

I do not think that this PR is fixing #9159 that's specifically about eliminating invariant array store checks in code that is swapping elements of an array. Invariant array store checks can be very expensive depending on the type of the array and type of the item being stored into the array. It is why it is interesting to eliminate them.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2025

(The idea behind this change sounds reasonable to me otherwise.)

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

I do not think that this PR is fixing #9159

You're right I just misremembered that it did fix it when I was working on this. Changed the wording.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

EgorBo and others added 2 commits September 5, 2025 13:21
Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants