Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Widen self-updates as part of IV widening #106131

Closed
wants to merge 1 commit into from

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Aug 8, 2024

For certain Intel CPUs it can be a perf regression to keep an IV update as a 32-bit operation when the IV is later used as a 64-bit register, despite the fact that the operation should come with explicit zeroing of the upper bits.

This PR starts detecting and widening some self-updates as part of IV widening when we can prove that doing the self-update as a 64 bit operation is safe (i.e. ends up with the same zeroed upper bits).

Fix #104655
Fix System.Collections.IndexerSet.List regression in #99315

Some code size regressions expected since the encoding for the 64 bit operations is larger. Some new strength reductions expected due to the improved logic in AddRecMayOverflow.

Example:

[MethodImpl(MethodImplOptions.NoInlining)]
public List<int> List()
{
    var list = _list;
    for (int i = 0; i < list.Count; i++)
        list[i] = default;
    return list;
}
@@ -1,59 +1,59 @@
 ; Assembly listing for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Windows
 ; FullOpts code
 ; optimized code
 ; rsp based frame
 ; fully interruptible
 ; No PGO data
 ; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T04] (  3,  3   )     ref  ->  rcx         this class-hnd single-def <Program>
 ;  V01 loc0         [V01,T02] (  6, 15   )     ref  ->  rax         class-hnd single-def <System.Collections.Generic.List`1[int]>
 ;* V02 loc1         [V02,T05] (  0,  0   )     int  ->  zero-ref   
 ;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V04 tmp1         [V04,T01] (  3, 24   )     ref  ->   r8         "arr expr"
 ;  V05 cse0         [V05,T03] (  4, 10   )     int  ->  rdx         "CSE #01: aggressive"
 ;  V06 rat0         [V06,T00] (  7, 25   )    long  ->  rcx         "Widened IV V02"
 ;
 ; Lcl frame size = 40
 
 G_M48712_IG01:  ;; offset=0x0000
        sub      rsp, 40
 						;; size=4 bbWeight=1 PerfScore 0.25
 G_M48712_IG02:  ;; offset=0x0004
        mov      rax, gword ptr [rcx+0x08]
        xor      ecx, ecx
        mov      edx, dword ptr [rax+0x10]
        test     edx, edx
        jle      SHORT G_M48712_IG04
        align    [15 bytes for IG03]
 						;; size=28 bbWeight=1 PerfScore 5.75
 G_M48712_IG03:  ;; offset=0x0020
        cmp      ecx, edx
        jae      SHORT G_M48712_IG05
        mov      r8, gword ptr [rax+0x08]
        cmp      ecx, dword ptr [r8+0x08]
        jae      SHORT G_M48712_IG06
        xor      r10d, r10d
        mov      dword ptr [r8+4*rcx+0x10], r10d
        inc      dword ptr [rax+0x14]
-       inc      ecx
+       inc      rcx
        cmp      ecx, edx
        jl       SHORT G_M48712_IG03
-						;; size=31 bbWeight=4 PerfScore 52.00
-G_M48712_IG04:  ;; offset=0x003F
+						;; size=32 bbWeight=4 PerfScore 52.00
+G_M48712_IG04:  ;; offset=0x0040
        add      rsp, 40
        ret      
 						;; size=5 bbWeight=1 PerfScore 1.25
-G_M48712_IG05:  ;; offset=0x0044
+G_M48712_IG05:  ;; offset=0x0045
        call     [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessException()]
        int3     
 						;; size=7 bbWeight=0 PerfScore 0.00
-G_M48712_IG06:  ;; offset=0x004B
+G_M48712_IG06:  ;; offset=0x004C
        call     CORINFO_HELP_RNGCHKFAIL
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 81, prolog size 4, PerfScore 59.25, instruction count 24, allocated bytes for code 81 (MethodHash=a57541b7) for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)
+; Total bytes of code 82, prolog size 4, PerfScore 59.25, instruction count 24, allocated bytes for code 82 (MethodHash=a57541b7) for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)

For certain Intel CPUs it can be a perf regression to keep an IV update
as a 32-bit operation when the IV is later used as a 64-bit register,
despite the fact that the operation should come with explicit zeroing of
the upper bits.

This PR starts detecting and widening some self-updates as part of IV
widening when we can prove that doing the self-update as a 64 bit
operation is safe (i.e. ends up with the same zeroed upper bits).
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 8, 2024
@jakobbotsch

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@jakobbotsch
Copy link
Member Author

@EgorBot -intel --disasm

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using BenchmarkDotNet.Attributes;

public class Benchmark
{
    private List<int> _list = new List<int>(Enumerable.Range(0, 512));

    [Benchmark]
    public List<int> ListIndexerSet()
    {
        var list = _list;
        for (int i = 0; i < list.Count; i++)
            list[i] = default;
        return list;
    }
}

@EgorBot
Copy link

EgorBot commented Aug 8, 2024

Benchmark results on Intel
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-RIXUYF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-GHQSGJ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio Code Size
ListIndexerSet Main 1.086 μs 0.0040 μs 1.00 75 B
ListIndexerSet PR 1.085 μs 0.0003 μs 1.00 76 B

BDN_Artifacts.zip

@jakobbotsch
Copy link
Member Author

Hmm no, that didn't help here.. Given the results of #104655 it really seems like there is something micro architectural going on here and that the slow down might not be related to the widening at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

for loop performance regression
2 participants