Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop cloning for Span #113575

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Loop cloning for Span #113575

wants to merge 1 commit into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 15, 2025

This PR enables loop cloning for Spans
Closes #82946
Closes #110986
Closes #112019

Example:

static void Test(Span<int> span, int len)
{
    for (int i = 0; i < len; i++)
        span[i] = 0;
}

Current codegen:

; Assembly listing for method Test(System.Span`1[int],int) (FullOpts)
       sub      rsp, 40
       xor      eax, eax
       test     edx, edx
       jle      SHORT G_M2065_IG04
       align    [0 bytes for IG03]
G_M2065_IG03:
       cmp      eax, dword ptr [rcx+0x08]    ;; <-- bounds check each iteration
       jae      SHORT G_M2065_IG05
       mov      r8, bword ptr [rcx]
       xor      r10d, r10d
       mov      dword ptr [r8+4*rax], r10d
       inc      eax
       cmp      eax, edx
       jl       SHORT G_M2065_IG03
G_M2065_IG04:
       add      rsp, 40
       ret    
  
G_M2065_IG05:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 42, prolog size 4, PerfScore 38.00

New codegen:

; Assembly listing for method Test(System.Span`1[int],int) (FullOpts)
       sub      rsp, 40
       xor      eax, eax
       test     edx, edx
       jle      SHORT G_M2065_IG05
       cmp      edx, dword ptr [rcx+0x08]
       jg       SHORT G_M2065_IG06
       xor      eax, eax
       align    [15 bytes for IG04]
G_M2065_IG04:
       mov      r8, bword ptr [rcx]     ;; <-- no bounds checks (fast loop)
       xor      r10d, r10d
       mov      dword ptr [r8+rax], r10d
       add      rax, 4
       dec      edx
       jne      SHORT G_M2065_IG04
G_M2065_IG05:
       add      rsp, 40
       ret      

G_M2065_IG06:
       cmp      eax, dword ptr [rcx+0x08]  ;; slow loop (cloned)
       jae      SHORT G_M2065_IG07
       mov      r8, bword ptr [rcx]
       mov      r10d, eax
       xor      r9d, r9d
       mov      dword ptr [r8+4*r10], r9d
       inc      eax
       cmp      eax, edx
       jl       SHORT G_M2065_IG06
       jmp      SHORT G_M2065_IG05
G_M2065_IG07:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 87, prolog size 4, PerfScore 23.38

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 15, 2025
@EgorBo EgorBo changed the title Loop cloning for Span (non-promoted) Loop cloning for Span Mar 15, 2025
@EgorBo

This comment was marked as outdated.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2025

@MihuBot

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

/azp list

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr pgo, runtime-coreclr pgostress, Fuzzlyn

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@bencyoung-Fignum
Copy link

Out of interest, for loop cloning, if (len > span.Length) would it more less code/more efficient to run the cloned loop up to span.Length and only then switch to the un-optimized version? Or because that path is likely to throw anyway, it's better to just do the simplest thing?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

Out of interest, for loop cloning, if (len > span.Length) would it more less code/more efficient to run the cloned loop up to span.Length and only then switch to the un-optimized version? Or because that path is likely to throw anyway, it's better to just do the simplest thing?

Yep, the current impl is easier to implement and is more generic - we also need to check array instance for being null (for arrays) and there are other kinds of cloning conditions, e.g. if we have a virtual call inside the loop, we can add an additional cloning condition for the most popular type under that virtual call (PGO).

@bencyoung-Fignum
Copy link

Thanks for the info. So would all "likely" optimization go in the optimized verison of the loop, and none of them in the fallback or could you have some combinations? E.g. potentially multiple clone loops with different assumptions? I guess you can assume the fallback is always the fully-unoptimized version as you assume there will be a failure at some point

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

Thanks for the info. So would all "likely" optimization go in the optimized verison of the loop, and none of them in the fallback or could you have some combinations? E.g. potentially multiple clone loops with different assumptions? I guess you can assume the fallback is always the fully-unoptimized version as you assume there will be a failure at some point

It depends. Normally, yes, fallback is not expected to be hit in normal circumstances unless code is relying on OOB exception, but it is not the case for virtual calls, we clone loops with them but the fallback still may be invoked (when some other type arrives), we discussed this recently in #113579 (comment)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

/azp run runtime-coreclr jitstress, runtime-coreclr pgo, runtime-coreclr pgostress

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

@MihuBot

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@EgorBo EgorBo marked this pull request as ready for review March 18, 2025 01:35
@Copilot Copilot bot review requested due to automatic review settings March 18, 2025 01:35

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

@AndyAyersMS @BruceForstall @dotnet/jit-contrib PTAL

Surprisingly, it was not difficult, my changes are mostly cosmetic (with asserts). Basically, if we have a LCL_VAR length, we don't need to deref the array object (it was either already dereferenced when this local was created, or it's a local span that doesn't need any dereference).

Diffs look sane to me, the TP impact is ~0.2% on average with a huge outlier in libraries_tests.run., however, same happens today for existing array cloning (for reference, here are the diffs for Main where loop cloning is disabled: diffs). The diffs are PerfScore improvements, they're better if we mark the cloned loop (slow one) as cold (today, we mark it as 0.01 weight).

Outerloop failures are not related.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've overloaded the existing "jagged" array implementation with a case to support Span. Is this the cleanest way to express this? Would it be better to introduce a new LC_OPT(LcSpan) "type" of optimization (and maybe a LC_Span type that parallels LC_Array, etc.)?

Can Span participate in "jagged" arrays? E.g., for a[x][y][z], can a be a Span, a[x] be a span, a[x][y] be an array?

@EgorBo

This comment was marked as outdated.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2025

@BruceForstall I've addressed your feedback. The impl is 2x bigger now, but I agree that it looks better. Diffs

assert(isIncreasingLoop || iterInfo->IsDecreasingLoop());
if (!isIncreasingLoop && !iterInfo->IsDecreasingLoop())
{
// Normally, we reject weird-looking loops in optIsLoopClonable, but it's not the case
Copy link
Member Author

@EgorBo EgorBo Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small pre-existing issue, can be reproduced (hits an assert in Checked) in Main via this snippet

Click me
using System;
using System.Runtime.CompilerServices;

class Program : IDisposable
{
    public static void Main()
    {
        for (int i = 0; i < 1200; i++)
        {
            try
            {
                Test(new int[100000000], 44, new Program());
                Thread.Sleep(16);
            }
            catch
            {
            }
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Test(int[] arr, int x, IDisposable d)
    {
        for (int i = 0; i < x; i--)
        {
            d.Dispose();
            Console.WriteLine(arr[i]);
        }
    }

    public void Dispose()
    {
    }
}

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2025

@EgorBot -amd -arm -profiler

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    byte[] _arr1 = new byte[2000];
    byte[] _arr2 = new byte[2000];

    [Benchmark]
    [Arguments(1000)]
    public void CopyN(int elems)
    {
        Span<byte> span1 = _arr1;
        Span<byte> span2 = _arr2;

        for (int i = 0; i < elems; i++)
            span1[i] = span2[i];
    }


    [Benchmark]
    [Arguments(1000)]
    public void ReversedIter(int elems)
    {
        Span<byte> span = _arr1;

        // Reversed iteration
        for (int i = span.Length - 1; i >= 0; i--)
            span[i] = 42;
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
3 participants