Extend CPU capabilities detection for osx-arm64 (#62832) #62958

neon-sunset · 2021-12-17T15:59:19Z

This change adds support for detection of CPU capabilities on targets like osx-arm64 which do not have getauxval(AT_HWCAP) as recommended in https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code.

In addition, we also opportunistically assume that the environment does not support the scenario in which the execution of DC ZVA memory zeroing instructions can be trapped by the kernel or host system. However, if https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 completely overrides this flag, then the change is unnecessary. Please let me know what is the best way to handle it.

Otherwise, on M1, DC ZVA zeroing completely saturates data bandwidth which makes it fully efficient.
References: https://www.realworldtech.com/forum/?threadid=192122&curpostid=192321 and https://twitter.com/polydron/status/1458890243336138771?s=21

dnfadmin · 2021-12-17T15:59:31Z

All CLA requirements met.

neon-sunset · 2021-12-20T14:05:23Z

/azp run runtime

azure-pipelines · 2021-12-20T14:05:28Z

Commenter does not have sufficient privileges for PR 62958 in repo dotnet/runtime

neon-sunset · 2021-12-22T11:50:53Z

Hm, it seems like the bot didn't tag area owners. CC @janvorli just in case :)

janvorli · 2022-01-03T15:02:34Z

cc: @dotnet/jit-contrib, this stuff is managed by the JIT people.

EgorBo · 2022-01-03T15:13:47Z

Looks good to me except Dczva I am not familiar with, so waiting for Approval from @echesakovMSFT who used it in #46609

A small unimportant note: this PR exposes these intrinsics only on osx >=12.0, 11.x didn't expose them like this in sysctl

neon-sunset · 2022-01-03T17:19:35Z

@EgorBo if you still have an M1 device with 12.0/12.0.1, could you post the output of systcl -a | grep machdep.cpu?

echesakov · 2022-01-03T18:32:38Z

@neon-sunset Do you know why the logic in https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 does not set the flag?

I looked more carefully at the Arm spec and it might be that DC ZVA uses a different size of the block when zeroing the memory. If it is a case, we cannot just enable it like that.
The reason is that JIT assumes that DC ZVA will zero 64 bytes per instruction here

runtime/src/coreclr/jit/codegencommon.cpp

Line 5759 in 6c50d9f

compiler->compOpportunisticallyDependsOn(InstructionSet_Dczva))

.
I can debug this on an M1 device later today.

Update: given that M1 uses 128 bytes cacheline I am affraid that this is the case:

sysctl -a | grep hw.cacheline
hw.cachelinesize: 128

EgorBo · 2022-01-03T19:19:02Z

@EgorBo if you still have an M1 device with 12.0/12.0.1, could you post the output of systcl -a | grep machdep.cpu?

machdep.cpu.brand_string: Apple M1
machdep.cpu.core_count: 8
machdep.cpu.cores_per_package: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8

Update: given that M1 uses 128 bytes cacheline

as well as most other arm hw I'd guess?

echesakov · 2022-01-03T19:34:40Z

Update: given that M1 uses 128 bytes cacheline

as well as most other arm hw I'd guess?

I don't know if most of them use 128 bytes cacheline.
I always thought it's 64 bytes. For example, Cortex-A77

EgorBo · 2022-01-03T19:40:40Z

I don't know if most of them use 128 bytes cacheline. I always thought it's 64 bytes. For example, Cortex-A77

.NET assumes it's 128 by default on arm, e.g.

runtime/src/libraries/System.Private.CoreLib/src/Internal/Padding.cs

Line 13 in 3847790

internal const int CACHE_LINE_SIZE = 128;

(used against "false sharing")

neon-sunset · 2022-01-03T21:05:49Z

An update regarding DC ZVA. After re-reading documentation it appears that the erased block size is not constrained or limited to cache line size.

Excerpt from ARM documentation on the register which indicates the block size and availability:

DCZID_EL0, Data Cache Zero ID register
...
Indicates the block size that is written with byte values of 0 by the DC ZVA (Data Cache Zero by Address) System instruction....
--- DZP, bit [4] ---
Data Zero Prohibited. This field indicates whether use of DC ZVA instructions is permitted or prohibited.
...
--- BS, bits [3:0] ---
Log2 of the block size in words. The maximum size supported is 2KB (value == 9).

After further research, I found an interesting piece of code that performs the heuristics to check the behaviour of DC ZVA:
https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230/diffs#d279bd8f1d8c70fc71e225327b9a92bf44666f4d

On my device (M1 Pro macOS 12.1) it does return 64 (bytes) as the zeroed block size.

echesakov · 2022-01-03T21:11:37Z

@neon-sunset Thanks for following up. I believe we don't need to do anything for DC ZVA in this PR - if GetDataCacheZeroIDReg() returns 4 on macOS that the corresponding flag will be set.

neon-sunset · 2022-01-03T21:24:37Z

@echesakovMSFT Thanks, that's what I was asking about in the initial post :) Still, it was interesting to look into. Will revert the DC ZVA part.

echesakov · 2022-01-03T21:30:02Z

I double checked this.

Consider the following C# program:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace StackZeroing
{
    [StructLayout(LayoutKind.Sequential, Size = 1024)]
    struct LargeStruct
    {
    }

    class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        static void KeepAlive(ref LargeStruct ls)
        {
        }

        static void Main(string[] args)
        {
            LargeStruct ls = default(LargeStruct);
            KeepAlive(ref ls);
        }
    }
}

On Apple M1 the JIT will generate:

$CORE_ROOT/corerun StackZeroing.dll
; Assembly listing for method StackZeroing.Program:Main(System.String[])
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )     ref  ->  zero-ref    class-hnd single-def
;  V01 loc0         [V01    ] (  1,  1   )  struct (1024) [fp+10H]   do-not-enreg[XS] must-init addr-exposed ld-addr-op single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 1024

G_M10532_IG01:              ;; offset=0000H
        00000000          sub     sp, sp, #0x410
        00000000          stp     fp, lr, [sp]
        00000000          mov     fp, sp
        00000000          movi    v16.16b, #0x00
        00000000          add     x9, fp, #80
        00000000          add     x10, fp, #976
        00000000          stp     q16, q16, [x9,#-64]
        00000000          stp     q16, q16, [x9,#-32]
        00000000          bfm     x9, xzr, #0, #5
        00000000          dczva   x9
        00000000          add     x9, x9, #64
        00000000          cmp     x9, x10
        00000000          blo     pc-16 (-4 instructions)
        00000000          stp     q16, q16, [x10]
        00000000          stp     q16, q16, [x10,#32]
						;; bbWeight=1    PerfScore 11.50
G_M10532_IG02:              ;; offset=003CH
        00000000          add     x0, fp, #16	// [V01 loc0]
        00000000          bl      StackZeroing.Program:KeepAlive(byref)
						;; bbWeight=1    PerfScore 1.50
G_M10532_IG03:              ;; offset=0044H
        00000000          ldp     fp, lr, [sp]
        00000000          add     sp, sp, #0x410
        00000000          ret     lr
						;; bbWeight=1    PerfScore 2.50

; Total bytes of code 80, prolog size 60, PerfScore 23.50, instruction count 20, allocated bytes for code 80 (MethodHash=826bd6db) for method StackZeroing.Program:Main(System.String[])
; ============================================================

echesakov

LGTM

echesakov · 2022-01-04T01:13:04Z

@neon-sunset Thank you for your contribution!

Extend CPU capabilities detection for osx-arm64 (dotnet#62832)

fdf28da

ghost added area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member labels Dec 17, 2021

neon-sunset mentioned this pull request Dec 17, 2021

System.Runtime.Intrinsics.Arm.Sha256 APIs are unavailable on Apple Silicon #62832

Closed

neon-sunset closed this Dec 20, 2021

neon-sunset reopened this Dec 20, 2021

runfoapp bot mentioned this pull request Dec 20, 2021

RegexKnownPatternTests.TerminationInNonBacktrackingVsBackTracking Failures on Linux ARM32 #62873

Closed

EgorBo approved these changes Jan 3, 2022

View reviewed changes

echesakov approved these changes Jan 3, 2022

View reviewed changes

echesakov self-requested a review January 3, 2022 18:29

Revert uncoditional enable for dczva on osx-arm64

001ed99

echesakov approved these changes Jan 3, 2022

View reviewed changes

echesakov merged commit 3580ba7 into dotnet:main Jan 4, 2022

neon-sunset deleted the 62832-fix-osx-arm64-capabilities-detection branch January 5, 2022 06:24

ghost locked as resolved and limited conversation to collaborators Mar 3, 2022

Extend CPU capabilities detection for osx-arm64 (#62832) #62958

Extend CPU capabilities detection for osx-arm64 (#62832) #62958

Uh oh!

Conversation

neon-sunset commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnfadmin commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neon-sunset commented Dec 20, 2021

Uh oh!

azure-pipelines bot commented Dec 20, 2021

Uh oh!

neon-sunset commented Dec 22, 2021

Uh oh!

janvorli commented Jan 3, 2022

Uh oh!

EgorBo commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neon-sunset commented Jan 3, 2022

Uh oh!

echesakov commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Jan 3, 2022

Uh oh!

echesakov commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Jan 3, 2022

Uh oh!

neon-sunset commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echesakov commented Jan 3, 2022

Uh oh!

neon-sunset commented Jan 3, 2022

Uh oh!

echesakov commented Jan 3, 2022

Uh oh!

echesakov left a comment

Choose a reason for hiding this comment

Uh oh!

echesakov commented Jan 4, 2022

Uh oh!

Uh oh!

neon-sunset commented Dec 17, 2021 •

edited

Loading

dnfadmin commented Dec 17, 2021 •

edited

Loading

EgorBo commented Jan 3, 2022 •

edited

Loading

echesakov commented Jan 3, 2022 •

edited

Loading

echesakov commented Jan 3, 2022 •

edited

Loading

neon-sunset commented Jan 3, 2022 •

edited

Loading