-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend CPU capabilities detection for osx-arm64 (#62832) #62958
Extend CPU capabilities detection for osx-arm64 (#62832) #62958
Conversation
/azp run runtime |
Commenter does not have sufficient privileges for PR 62958 in repo dotnet/runtime |
Hm, it seems like the bot didn't tag area owners. CC @janvorli just in case :) |
cc: @dotnet/jit-contrib, this stuff is managed by the JIT people. |
Looks good to me except A small unimportant note: this PR exposes these intrinsics only on osx >=12.0, 11.x didn't expose them like this in |
@EgorBo if you still have an M1 device with 12.0/12.0.1, could you post the output of |
@neon-sunset Do you know why the logic in https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 does not set the flag? I looked more carefully at the Arm spec and it might be that runtime/src/coreclr/jit/codegencommon.cpp Line 5759 in 6c50d9f
I can debug this on an M1 device later today. Update: given that M1 uses 128 bytes cacheline I am affraid that this is the case:
|
as well as most other arm hw I'd guess? |
I don't know if most of them use 128 bytes cacheline. |
.NET assumes it's 128 by default on arm, e.g.
|
An update regarding Excerpt from ARM documentation on the register which indicates the block size and availability:
After further research, I found an interesting piece of code that performs the heuristics to check the behaviour of On my device (M1 Pro macOS 12.1) it does return |
@neon-sunset Thanks for following up. I believe we don't need to do anything for |
@echesakovMSFT Thanks, that's what I was asking about in the initial post :) Still, it was interesting to look into. Will revert the |
I double checked this. Consider the following C# program: using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
namespace StackZeroing
{
[StructLayout(LayoutKind.Sequential, Size = 1024)]
struct LargeStruct
{
}
class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void KeepAlive(ref LargeStruct ls)
{
}
static void Main(string[] args)
{
LargeStruct ls = default(LargeStruct);
KeepAlive(ref ls);
}
}
} On Apple M1 the JIT will generate: $CORE_ROOT/corerun StackZeroing.dll
; Assembly listing for method StackZeroing.Program:Main(System.String[])
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) ref -> zero-ref class-hnd single-def
; V01 loc0 [V01 ] ( 1, 1 ) struct (1024) [fp+10H] do-not-enreg[XS] must-init addr-exposed ld-addr-op single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 1024
G_M10532_IG01: ;; offset=0000H
00000000 sub sp, sp, #0x410
00000000 stp fp, lr, [sp]
00000000 mov fp, sp
00000000 movi v16.16b, #0x00
00000000 add x9, fp, #80
00000000 add x10, fp, #976
00000000 stp q16, q16, [x9,#-64]
00000000 stp q16, q16, [x9,#-32]
00000000 bfm x9, xzr, #0, #5
00000000 dczva x9
00000000 add x9, x9, #64
00000000 cmp x9, x10
00000000 blo pc-16 (-4 instructions)
00000000 stp q16, q16, [x10]
00000000 stp q16, q16, [x10,#32]
;; bbWeight=1 PerfScore 11.50
G_M10532_IG02: ;; offset=003CH
00000000 add x0, fp, #16 // [V01 loc0]
00000000 bl StackZeroing.Program:KeepAlive(byref)
;; bbWeight=1 PerfScore 1.50
G_M10532_IG03: ;; offset=0044H
00000000 ldp fp, lr, [sp]
00000000 add sp, sp, #0x410
00000000 ret lr
;; bbWeight=1 PerfScore 2.50
; Total bytes of code 80, prolog size 60, PerfScore 23.50, instruction count 20, allocated bytes for code 80 (MethodHash=826bd6db) for method StackZeroing.Program:Main(System.String[])
; ============================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@neon-sunset Thank you for your contribution! |
Fixes #62832
This change adds support for detection of CPU capabilities on targets like
osx-arm64
which do not havegetauxval(AT_HWCAP)
as recommended in https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code.In addition, we also opportunistically assume that the environment does not support the scenario in which the execution of
DC ZVA
memory zeroing instructions can be trapped by the kernel or host system. However, if https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 completely overrides this flag, then the change is unnecessary. Please let me know what is the best way to handle it.Otherwise, on M1,
DC ZVA
zeroing completely saturates data bandwidth which makes it fully efficient.References: https://www.realworldtech.com/forum/?threadid=192122&curpostid=192321 and https://twitter.com/polydron/status/1458890243336138771?s=21