Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CPU capabilities detection for osx-arm64 (#62832) #62958

Merged

Conversation

neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented Dec 17, 2021

Fixes #62832

This change adds support for detection of CPU capabilities on targets like osx-arm64 which do not have getauxval(AT_HWCAP) as recommended in https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code.

In addition, we also opportunistically assume that the environment does not support the scenario in which the execution of DC ZVA memory zeroing instructions can be trapped by the kernel or host system. However, if https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 completely overrides this flag, then the change is unnecessary. Please let me know what is the best way to handle it.

Otherwise, on M1, DC ZVA zeroing completely saturates data bandwidth which makes it fully efficient.
References: https://www.realworldtech.com/forum/?threadid=192122&curpostid=192321 and https://twitter.com/polydron/status/1458890243336138771?s=21

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 17, 2021
@dnfadmin
Copy link

dnfadmin commented Dec 17, 2021

CLA assistant check
All CLA requirements met.

@neon-sunset
Copy link
Contributor Author

/azp run runtime

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 62958 in repo dotnet/runtime

@neon-sunset
Copy link
Contributor Author

Hm, it seems like the bot didn't tag area owners. CC @janvorli just in case :)

@janvorli
Copy link
Member

janvorli commented Jan 3, 2022

cc: @dotnet/jit-contrib, this stuff is managed by the JIT people.

@EgorBo
Copy link
Member

EgorBo commented Jan 3, 2022

Looks good to me except Dczva I am not familiar with, so waiting for Approval from @echesakovMSFT who used it in #46609

A small unimportant note: this PR exposes these intrinsics only on osx >=12.0, 11.x didn't expose them like this in sysctl

@neon-sunset
Copy link
Contributor Author

@EgorBo if you still have an M1 device with 12.0/12.0.1, could you post the output of systcl -a | grep machdep.cpu?

@echesakov echesakov self-requested a review January 3, 2022 18:29
@echesakov
Copy link
Contributor

echesakov commented Jan 3, 2022

@neon-sunset Do you know why the logic in https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1485 does not set the flag?

I looked more carefully at the Arm spec and it might be that DC ZVA uses a different size of the block when zeroing the memory. If it is a case, we cannot just enable it like that.
The reason is that JIT assumes that DC ZVA will zero 64 bytes per instruction here

compiler->compOpportunisticallyDependsOn(InstructionSet_Dczva))
.
I can debug this on an M1 device later today.

Update: given that M1 uses 128 bytes cacheline I am affraid that this is the case:

sysctl -a | grep hw.cacheline
hw.cachelinesize: 128

@EgorBo
Copy link
Member

EgorBo commented Jan 3, 2022

@EgorBo if you still have an M1 device with 12.0/12.0.1, could you post the output of systcl -a | grep machdep.cpu?

machdep.cpu.brand_string: Apple M1
machdep.cpu.core_count: 8
machdep.cpu.cores_per_package: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8

Update: given that M1 uses 128 bytes cacheline

as well as most other arm hw I'd guess?

@echesakov
Copy link
Contributor

echesakov commented Jan 3, 2022

Update: given that M1 uses 128 bytes cacheline

as well as most other arm hw I'd guess?

I don't know if most of them use 128 bytes cacheline.
I always thought it's 64 bytes. For example, Cortex-A77

@EgorBo
Copy link
Member

EgorBo commented Jan 3, 2022

I don't know if most of them use 128 bytes cacheline. I always thought it's 64 bytes. For example, Cortex-A77

.NET assumes it's 128 by default on arm, e.g.

internal const int CACHE_LINE_SIZE = 128;
(used against "false sharing")

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 3, 2022

An update regarding DC ZVA. After re-reading documentation it appears that the erased block size is not constrained or limited to cache line size.

Excerpt from ARM documentation on the register which indicates the block size and availability:

DCZID_EL0, Data Cache Zero ID register
...
Indicates the block size that is written with byte values of 0 by the DC ZVA (Data Cache Zero by Address) System instruction....
--- DZP, bit [4] ---
Data Zero Prohibited. This field indicates whether use of DC ZVA instructions is permitted or prohibited.
...
--- BS, bits [3:0] ---
Log2 of the block size in words. The maximum size supported is 2KB (value == 9).

After further research, I found an interesting piece of code that performs the heuristics to check the behaviour of DC ZVA:
https://i10git.cs.fau.de/pycodegen/pystencils/-/merge_requests/230/diffs#d279bd8f1d8c70fc71e225327b9a92bf44666f4d

On my device (M1 Pro macOS 12.1) it does return 64 (bytes) as the zeroed block size.

@echesakov
Copy link
Contributor

@neon-sunset Thanks for following up. I believe we don't need to do anything for DC ZVA in this PR - if GetDataCacheZeroIDReg() returns 4 on macOS that the corresponding flag will be set.

@neon-sunset
Copy link
Contributor Author

@echesakovMSFT Thanks, that's what I was asking about in the initial post :) Still, it was interesting to look into. Will revert the DC ZVA part.

@echesakov
Copy link
Contributor

I double checked this.

Consider the following C# program:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace StackZeroing
{
    [StructLayout(LayoutKind.Sequential, Size = 1024)]
    struct LargeStruct
    {
    }

    class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        static void KeepAlive(ref LargeStruct ls)
        {
        }

        static void Main(string[] args)
        {
            LargeStruct ls = default(LargeStruct);
            KeepAlive(ref ls);
        }
    }
}

On Apple M1 the JIT will generate:

$CORE_ROOT/corerun StackZeroing.dll
; Assembly listing for method StackZeroing.Program:Main(System.String[])
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )     ref  ->  zero-ref    class-hnd single-def
;  V01 loc0         [V01    ] (  1,  1   )  struct (1024) [fp+10H]   do-not-enreg[XS] must-init addr-exposed ld-addr-op single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 1024

G_M10532_IG01:              ;; offset=0000H
        00000000          sub     sp, sp, #0x410
        00000000          stp     fp, lr, [sp]
        00000000          mov     fp, sp
        00000000          movi    v16.16b, #0x00
        00000000          add     x9, fp, #80
        00000000          add     x10, fp, #976
        00000000          stp     q16, q16, [x9,#-64]
        00000000          stp     q16, q16, [x9,#-32]
        00000000          bfm     x9, xzr, #0, #5
        00000000          dczva   x9
        00000000          add     x9, x9, #64
        00000000          cmp     x9, x10
        00000000          blo     pc-16 (-4 instructions)
        00000000          stp     q16, q16, [x10]
        00000000          stp     q16, q16, [x10,#32]
						;; bbWeight=1    PerfScore 11.50
G_M10532_IG02:              ;; offset=003CH
        00000000          add     x0, fp, #16	// [V01 loc0]
        00000000          bl      StackZeroing.Program:KeepAlive(byref)
						;; bbWeight=1    PerfScore 1.50
G_M10532_IG03:              ;; offset=0044H
        00000000          ldp     fp, lr, [sp]
        00000000          add     sp, sp, #0x410
        00000000          ret     lr
						;; bbWeight=1    PerfScore 2.50

; Total bytes of code 80, prolog size 60, PerfScore 23.50, instruction count 20, allocated bytes for code 80 (MethodHash=826bd6db) for method StackZeroing.Program:Main(System.String[])
; ============================================================

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@echesakov echesakov merged commit 3580ba7 into dotnet:main Jan 4, 2022
@echesakov
Copy link
Contributor

@neon-sunset Thank you for your contribution!

@neon-sunset neon-sunset deleted the 62832-fix-osx-arm64-capabilities-detection branch January 5, 2022 06:24
@ghost ghost locked as resolved and limited conversation to collaborators Mar 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

System.Runtime.Intrinsics.Arm.Sha256 APIs are unavailable on Apple Silicon
5 participants