Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Clone blocks with bounds checks #112595

Merged
merged 29 commits into from
Feb 18, 2025
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 15, 2025

Closes #112524
Closes #109983

For an arbitrary block

arr[i + 1] = x;
arr[i + 3] = y;
arr[i + 5] = z;
arr[i + 6] = w;

In order to remove bounds checks, JIT can clone the whole thing to have fast and slow paths under "block clonning conditions":

if (i >= 0 && i < arr.Length - 6)
{
    // fast path
    arr[i + 1] = x; // no bounds check
    arr[i + 3] = y; // no bounds check
    arr[i + 5] = z; // no bounds check
    arr[i + 6] = w; // no bounds check
}
else
{
    // slow path
    arr[i + 1] = x;
    arr[i + 3] = y;
    arr[i + 5] = w;
    arr[i + 6] = w;
}

It works not only for stores, but for any expressions with bounds checks, e.g. arr[1] + arr[2] + arr[3]

Codegen example:

void Test(int[] arr)
{
    arr[0] = 10;
    arr[1] = 11;
    arr[2] = 12;
    arr[3] = 13;
}

Current codegen (4 bounds checks):

; Assembly listing for method Proga:Test(int[]):this (FullOpts)
    stp     fp, lr, [sp, #-0x10]!
    mov     fp, sp
    ldr     w0, [x1, #0x08]

    ;; arr[0] = 10
    cbz     w0, G_M65333_IG04
    mov     w2, #10
    str     w2, [x1, #0x10]

    ;; arr[1] = 11
    cmp     w0, #1
    bls     G_M65333_IG04
    mov     w2, #11
    str     w2, [x1, #0x14]

    ;; arr[2] = 12
    cmp     w0, #2
    bls     G_M65333_IG04
    mov     w2, #12
    str     w2, [x1, #0x18]

    ;; arr[3] = 13
    cmp     w0, #3
    bls     G_M65333_IG04
    mov     w0, #13
    str     w0, [x1, #0x1C]

    ldp     fp, lr, [sp], #0x10
    ret     lr
G_M65333_IG04:
    bl      CORINFO_HELP_RNGCHKFAIL
    brk     #0
; Total bytes of code 88

New codegen (single SIMD store in the fast path!):

; Assembly listing for method Proga:Test(int[]) (FullOpts)
    stp     fp, lr, [sp, #-0x10]!
    mov     fp, sp
    ldr     w1, [x0, #0x08]
    sxtw    w2, w1
    cmp     w1, #3
    ble     G_M50025_IG05
    ldr     q16, [@RWD00]
    str     q16, [x0, #0x10] ;; <-- single Vector128 store!
G_M50025_IG04:
    ldp     fp, lr, [sp], #0x10
    ret     lr

    ;; Slow path (cold block):
G_M50025_IG05:
    cbz     w2, G_M50025_IG06
    mov     w2, #10
    str     w2, [x0, #0x10]
    cmp     w1, #1
    bls     G_M50025_IG06
    mov     w2, #11
    str     w2, [x0, #0x14]
    cmp     w1, #2
    bls     G_M50025_IG06
    mov     w2, #12
    str     w2, [x0, #0x18]
    cmp     w1, #3
    bls     G_M50025_IG06
    mov     w1, #13
    str     w1, [x0, #0x1C]
    b       G_M50025_IG04
G_M50025_IG06:
    bl      CORINFO_HELP_RNGCHKFAIL
    brk     #0
RWD00  	dq	0000000B0000000Ah, 0000000D0000000Ch
; Total bytes of code 112

@EgorBo

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 16, 2025

This PR also sort of closes #109983

image

For this case we don't have to clone the blocks actually, but it's better than nothing 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Feb 16, 2025

@EgorBot -amd -arm --filter benchmarkLU

@EgorBo

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 17, 2025

Actually, this should be good as is (as a start). Diffs - obviously, a size regression, but mostly a clean PerfScore improvement:

image
A random dotnet/performance benchmarks I ran (since it appeared in the diffs):
image

The PR scans all GT_BOUNDS_CHECK nodes in a block, then groups them by "base indexVN + lengthVN" in a hash table (sort of Dictionary<(baseVN, lenVN), BoundsCheck>), takes a group with the most bounds checks, calculates their Statement range, splits the block at the very first bounds check in the group (so the length and the index of the first bounds check are spilled by gtSplitTree if they have side effects) and clones the rest. Then repeats the same in case if we have more groups.

I decided to perform this optimization after the range check phase - it allows me to handle only those bounds checks nobody was able to handle before me. A bit unfortunate phase ordering issue is that CSE sometimes decides to perform CSE for the index tree (because it's used in the bounds check and the actual array access) and when my algorithm drops the bounds check node, we're left with a redundant local:

image

This is main source of the regressions - it'd be nice to have some late forward sub to clean these up. Also, this issue is solved with JitOptRepeat, perhaps, my phase could request an extra iteration of that in the future (not today as JitOptRepeat has issues).

PTAL @jakobbotsch @AndyAyersMS @dotnet/jit-contrib

@EgorBo EgorBo marked this pull request as ready for review February 17, 2025 09:32
EgorBo and others added 5 commits February 17, 2025 13:53
Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now. Cool opt 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Block clonning for subsequent bounds checks Neighboring bounds checks aren't collapsed
3 participants