Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT perf: Span.Slice(0, ...) emits unnecessary 64-bit extensions #69200

Closed
GrabYourPitchforks opened this issue May 11, 2022 · 2 comments · Fixed by #69247
Closed

JIT perf: Span.Slice(0, ...) emits unnecessary 64-bit extensions #69200

GrabYourPitchforks opened this issue May 11, 2022 · 2 comments · Fixed by #69247
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

When calling Span<T>.Slice(offset, length), the internal implementation 64-bit extends offset and length, then compares their sum against span.Length. This is an optimization that allows us to get away with a single comparison.

public class C {
    public static void M(ReadOnlySpan<byte> span, int offset, int length) {
        DoIt(span.Slice(offset, length));
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void DoIt(ReadOnlySpan<byte> span)
    {
    }
}
C.M(System.ReadOnlySpan`1<Byte>, Int32, Int32)
    L0000: sub rsp, 0x38
    L0004: xor eax, eax
    L0006: mov [rsp+0x28], rax
    L000b: mov eax, edx  ; rax := 64-bit zero-extension of offset
    L000d: mov edx, r8d  ; rdx := 64-bit zero-extension of length 
    L0010: add rdx, rax  ; rdx := sum of offset and length
    L0013: mov r9d, [rcx+8]  ; r9 := 64-bit zero-extension of span.Length
    L0017: cmp rdx, r9  ; bounds check
    L001a: ja short L003c
    L001c: mov rcx, [rcx]
    L001f: add rcx, rax
    L0022: mov [rsp+0x28], rcx
    L0027: mov [rsp+0x30], r8d
    L002c: lea rcx, [rsp+0x28]
    L0031: call 0x00007ffbf1df0020
    L0036: nop
    L0037: add rsp, 0x38
    L003b: ret
    L003c: call 0x00007ffbe86e5778
    L0041: int3

However, when start = 0 is a known constant, JIT still emits the 64-bit extension before performing the comparison. This is unnecessary because a comparison of two zero-extended values will yield the same result as a comparison of the original values. It also burns a register.

    public static void M(ReadOnlySpan<byte> span, int length) {
        DoIt(span.Slice(0, length));
    }
C.M(System.ReadOnlySpan`1<Byte>, Int32)
    L0000: sub rsp, 0x38
    L0004: xor eax, eax
    L0006: mov [rsp+0x28], rax
    L000b: mov eax, edx  ; rax := unnecessary 64-bit zero-extension of length
    L000d: mov r8d, [rcx+8]  ; r8 := unnecessary 64-bit zero-extension of span.Length
    L0011: cmp rax, r8  ; bounds check
    L0014: ja short L0032
    L0016: mov rcx, [rcx]
    L0019: mov [rsp+0x28], rcx
    L001e: mov [rsp+0x30], edx
    L0022: lea rcx, [rsp+0x28]
    L0027: call 0x00007ffbf1e10020
    L002c: nop
    L002d: add rsp, 0x38
    L0031: ret
    L0032: call 0x00007ffbe86e5778
    L0037: int3

In the above example, I would have expected the three annotated instructions to be collapsed into the single instruction cmp edx, dword ptr [rcx + 8].

@GrabYourPitchforks GrabYourPitchforks added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels May 11, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 11, 2022
@ghost
Copy link

ghost commented May 11, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

When calling Span<T>.Slice(offset, length), the internal implementation 64-bit extends offset and length, then compares their sum against span.Length. This is an optimization that allows us to get away with a single comparison.

public class C {
    public static void M(ReadOnlySpan<byte> span, int offset, int length) {
        DoIt(span.Slice(offset, length));
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void DoIt(ReadOnlySpan<byte> span)
    {
    }
}
C.M(System.ReadOnlySpan`1<Byte>, Int32, Int32)
    L0000: sub rsp, 0x38
    L0004: xor eax, eax
    L0006: mov [rsp+0x28], rax
    L000b: mov eax, edx  ; rax := 64-bit zero-extension of offset
    L000d: mov edx, r8d  ; rdx := 64-bit zero-extension of length 
    L0010: add rdx, rax  ; rdx := sum of offset and length
    L0013: mov r9d, [rcx+8]  ; r9 := 64-bit zero-extension of span.Length
    L0017: cmp rdx, r9  ; bounds check
    L001a: ja short L003c
    L001c: mov rcx, [rcx]
    L001f: add rcx, rax
    L0022: mov [rsp+0x28], rcx
    L0027: mov [rsp+0x30], r8d
    L002c: lea rcx, [rsp+0x28]
    L0031: call 0x00007ffbf1df0020
    L0036: nop
    L0037: add rsp, 0x38
    L003b: ret
    L003c: call 0x00007ffbe86e5778
    L0041: int3

However, when start = 0 is a known constant, JIT still emits the 64-bit extension before performing the comparison. This is unnecessary because a comparison of two zero-extended values will yield the same result as a comparison of the original values. It also burns a register.

    public static void M(ReadOnlySpan<byte> span, int length) {
        DoIt(span.Slice(0, length));
    }
C.M(System.ReadOnlySpan`1<Byte>, Int32)
    L0000: sub rsp, 0x38
    L0004: xor eax, eax
    L0006: mov [rsp+0x28], rax
    L000b: mov eax, edx  ; rax := unnecessary 64-bit zero-extension of length
    L000d: mov r8d, [rcx+8]  ; r8 := unnecessary 64-bit zero-extension of span.Length
    L0011: cmp rax, r8  ; bounds check
    L0014: ja short L0032
    L0016: mov rcx, [rcx]
    L0019: mov [rsp+0x28], rcx
    L001e: mov [rsp+0x30], edx
    L0022: lea rcx, [rsp+0x28]
    L0027: call 0x00007ffbf1e10020
    L002c: nop
    L002d: add rsp, 0x38
    L0031: ret
    L0032: call 0x00007ffbe86e5778
    L0037: int3

In the above example, I would have expected the three annotated instructions to be collapsed into the single instruction cmp edx, dword ptr [rcx + 8].

Author: GrabYourPitchforks
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member

Yes, this looks quite easy. We end up with

               [000021] N---G+-N-U-                         └──▌  LE        int   
               [000015] -----+---U-                            ├──▌  CAST      long <- ulong <- uint
               [000003] -----+-----                              └──▌  LCL_VAR   int    V01 arg1         
               [000020] ----G+---U-                            └──▌  CAST      long <- ulong <- uint
               [000092] n---G+-----                               └──▌  IND       int   
               [000094] -----+-----                                  └──▌  ADD       byref 
               [000019] -----+-----                                     ├──▌  LCL_VAR   byref  V00 arg0         
               [000093] -----+-----                                     └──▌  CNS_INT   long   8 field offset Fseq[_length]

in morph.
I'll see if I can make some time for this.

@jakobbotsch jakobbotsch self-assigned this May 11, 2022
@jakobbotsch jakobbotsch removed the untriaged New issue has not been triaged by the area owner label May 11, 2022
jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue May 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 12, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone May 20, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 23, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants