JIT: Widen self-updates as part of IV widening #106131

jakobbotsch · 2024-08-08T13:42:02Z

For certain Intel CPUs it can be a perf regression to keep an IV update as a 32-bit operation when the IV is later used as a 64-bit register, despite the fact that the operation should come with explicit zeroing of the upper bits.

This PR starts detecting and widening some self-updates as part of IV widening when we can prove that doing the self-update as a 64 bit operation is safe (i.e. ends up with the same zeroed upper bits).

Fix #104655
Fix System.Collections.IndexerSet.List regression in #99315

Some code size regressions expected since the encoding for the 64 bit operations is larger. Some new strength reductions expected due to the improved logic in AddRecMayOverflow.

Example:

[MethodImpl(MethodImplOptions.NoInlining)]
public List<int> List()
{
    var list = _list;
    for (int i = 0; i < list.Count; i++)
        list[i] = default;
    return list;
}

@@ -1,59 +1,59 @@
 ; Assembly listing for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Windows
 ; FullOpts code
 ; optimized code
 ; rsp based frame
 ; fully interruptible
 ; No PGO data
 ; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T04] (  3,  3   )     ref  ->  rcx         this class-hnd single-def <Program>
 ;  V01 loc0         [V01,T02] (  6, 15   )     ref  ->  rax         class-hnd single-def <System.Collections.Generic.List`1[int]>
 ;* V02 loc1         [V02,T05] (  0,  0   )     int  ->  zero-ref   
 ;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V04 tmp1         [V04,T01] (  3, 24   )     ref  ->   r8         "arr expr"
 ;  V05 cse0         [V05,T03] (  4, 10   )     int  ->  rdx         "CSE #01: aggressive"
 ;  V06 rat0         [V06,T00] (  7, 25   )    long  ->  rcx         "Widened IV V02"
 ;
 ; Lcl frame size = 40
 
 G_M48712_IG01:  ;; offset=0x0000
        sub      rsp, 40
 						;; size=4 bbWeight=1 PerfScore 0.25
 G_M48712_IG02:  ;; offset=0x0004
        mov      rax, gword ptr [rcx+0x08]
        xor      ecx, ecx
        mov      edx, dword ptr [rax+0x10]
        test     edx, edx
        jle      SHORT G_M48712_IG04
        align    [15 bytes for IG03]
 						;; size=28 bbWeight=1 PerfScore 5.75
 G_M48712_IG03:  ;; offset=0x0020
        cmp      ecx, edx
        jae      SHORT G_M48712_IG05
        mov      r8, gword ptr [rax+0x08]
        cmp      ecx, dword ptr [r8+0x08]
        jae      SHORT G_M48712_IG06
        xor      r10d, r10d
        mov      dword ptr [r8+4*rcx+0x10], r10d
        inc      dword ptr [rax+0x14]
-       inc      ecx
+       inc      rcx
        cmp      ecx, edx
        jl       SHORT G_M48712_IG03
-						;; size=31 bbWeight=4 PerfScore 52.00
-G_M48712_IG04:  ;; offset=0x003F
+						;; size=32 bbWeight=4 PerfScore 52.00
+G_M48712_IG04:  ;; offset=0x0040
        add      rsp, 40
        ret      
 						;; size=5 bbWeight=1 PerfScore 1.25
-G_M48712_IG05:  ;; offset=0x0044
+G_M48712_IG05:  ;; offset=0x0045
        call     [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessException()]
        int3     
 						;; size=7 bbWeight=0 PerfScore 0.00
-G_M48712_IG06:  ;; offset=0x004B
+G_M48712_IG06:  ;; offset=0x004C
        call     CORINFO_HELP_RNGCHKFAIL
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 81, prolog size 4, PerfScore 59.25, instruction count 24, allocated bytes for code 81 (MethodHash=a57541b7) for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)
+; Total bytes of code 82, prolog size 4, PerfScore 59.25, instruction count 24, allocated bytes for code 82 (MethodHash=a57541b7) for method Program:List():System.Collections.Generic.List`1[int]:this (FullOpts)

For certain Intel CPUs it can be a perf regression to keep an IV update as a 32-bit operation when the IV is later used as a 64-bit register, despite the fact that the operation should come with explicit zeroing of the upper bits. This PR starts detecting and widening some self-updates as part of IV widening when we can prove that doing the self-update as a 64 bit operation is safe (i.e. ends up with the same zeroed upper bits).

jakobbotsch · 2024-08-08T14:07:58Z

@EgorBot -intel --disasm

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using BenchmarkDotNet.Attributes;

public class Benchmark
{
    private List<int> _list = new List<int>(Enumerable.Range(0, 512));

    [Benchmark]
    public List<int> ListIndexerSet()
    {
        var list = _list;
        for (int i = 0; i < list.Count; i++)
            list[i] = default;
        return list;
    }
}

EgorBot · 2024-08-08T14:26:34Z

Benchmark results on Intel

BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-RIXUYF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-GHQSGJ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio	Code Size
ListIndexerSet	Main	1.086 μs	0.0040 μs	1.00	75 B
ListIndexerSet	PR	1.085 μs	0.0003 μs	1.00	76 B

BDN_Artifacts.zip

jakobbotsch · 2024-08-08T14:28:31Z

Hmm no, that didn't help here.. Given the results of #104655 it really seems like there is something micro architectural going on here and that the slow down might not be related to the widening at all.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 8, 2024

dotnet-policy-service bot assigned jakobbotsch Aug 8, 2024

This comment was marked as resolved.

Sign in to view

jakobbotsch closed this Aug 8, 2024

This was referenced Aug 8, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[NativeAOT-iOS] MacCatalyst fails to start/log nativeaot.SmokeTests tests #105804

Open

github-actions bot locked and limited conversation to collaborators Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Widen self-updates as part of IV widening #106131

JIT: Widen self-updates as part of IV widening #106131

jakobbotsch commented Aug 8, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

jakobbotsch commented Aug 8, 2024

EgorBot commented Aug 8, 2024

jakobbotsch commented Aug 8, 2024

JIT: Widen self-updates as part of IV widening #106131

JIT: Widen self-updates as part of IV widening #106131

Conversation

jakobbotsch commented Aug 8, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

jakobbotsch commented Aug 8, 2024

EgorBot commented Aug 8, 2024

jakobbotsch commented Aug 8, 2024

jakobbotsch commented Aug 8, 2024 •

edited

Loading