Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"native" instruction set alias for AOT compilers #73246

Closed
jkotas opened this issue Aug 2, 2022 · 6 comments · Fixed by #87865
Closed

"native" instruction set alias for AOT compilers #73246

jkotas opened this issue Aug 2, 2022 · 6 comments · Fixed by #87865
Labels
area-NativeAOT-coreclr help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Aug 2, 2022

It would match the native architecture of the processor on which publishing happens.

Context:

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
@jkotas jkotas added this to the Future milestone Aug 2, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
@MichalStrehovsky
Copy link
Member

I think the most maintainable way might be to extract the CPU flag detection from the runtime:

bool DetectCPUFeatures()
{
#if defined(HOST_X86) || defined(HOST_AMD64) || defined(HOST_ARM64)
#if defined(HOST_X86) || defined(HOST_AMD64)
int cpuidInfo[4];
const int EAX = 0;
const int EBX = 1;
const int ECX = 2;
const int EDX = 3;
__cpuid(cpuidInfo, 0x00000000);
uint32_t maxCpuId = static_cast<uint32_t>(cpuidInfo[EAX]);
if (maxCpuId >= 1)
{
__cpuid(cpuidInfo, 0x00000001);
if (((cpuidInfo[EDX] & (1 << 25)) != 0) && ((cpuidInfo[EDX] & (1 << 26)) != 0)) // SSE & SSE2
{
if ((cpuidInfo[ECX] & (1 << 25)) != 0) // AESNI
{
g_cpuFeatures |= XArchIntrinsicConstants_Aes;
}
if ((cpuidInfo[ECX] & (1 << 1)) != 0) // PCLMULQDQ
{
g_cpuFeatures |= XArchIntrinsicConstants_Pclmulqdq;
}
if ((cpuidInfo[ECX] & (1 << 0)) != 0) // SSE3
{
g_cpuFeatures |= XArchIntrinsicConstants_Sse3;
if ((cpuidInfo[ECX] & (1 << 9)) != 0) // SSSE3
{
g_cpuFeatures |= XArchIntrinsicConstants_Ssse3;
if ((cpuidInfo[ECX] & (1 << 19)) != 0) // SSE4.1
{
g_cpuFeatures |= XArchIntrinsicConstants_Sse41;
if ((cpuidInfo[ECX] & (1 << 20)) != 0) // SSE4.2
{
g_cpuFeatures |= XArchIntrinsicConstants_Sse42;
if ((cpuidInfo[ECX] & (1 << 22)) != 0) // MOVBE
{
g_cpuFeatures |= XArchIntrinsicConstants_Movbe;
}
if ((cpuidInfo[ECX] & (1 << 23)) != 0) // POPCNT
{
g_cpuFeatures |= XArchIntrinsicConstants_Popcnt;
}
if (((cpuidInfo[ECX] & (1 << 27)) != 0) && ((cpuidInfo[ECX] & (1 << 28)) != 0)) // OSXSAVE & AVX
{
if (PalIsAvxEnabled() && (xmmYmmStateSupport() == 1))
{
g_cpuFeatures |= XArchIntrinsicConstants_Avx;
if ((cpuidInfo[ECX] & (1 << 12)) != 0) // FMA
{
g_cpuFeatures |= XArchIntrinsicConstants_Fma;
}
if (maxCpuId >= 0x07)
{
__cpuidex(cpuidInfo, 0x00000007, 0x00000000);
if ((cpuidInfo[EBX] & (1 << 5)) != 0) // AVX2
{
g_cpuFeatures |= XArchIntrinsicConstants_Avx2;
__cpuidex(cpuidInfo, 0x00000007, 0x00000001);
if ((cpuidInfo[EAX] & (1 << 4)) != 0) // AVX-VNNI
{
g_cpuFeatures |= XArchIntrinsicConstants_AvxVnni;
}
}
}
}
}
}
}
}
}
}
if (maxCpuId >= 0x07)
{
__cpuidex(cpuidInfo, 0x00000007, 0x00000000);
if ((cpuidInfo[EBX] & (1 << 3)) != 0) // BMI1
{
g_cpuFeatures |= XArchIntrinsicConstants_Bmi1;
}
if ((cpuidInfo[EBX] & (1 << 8)) != 0) // BMI2
{
g_cpuFeatures |= XArchIntrinsicConstants_Bmi2;
}
}
}
__cpuid(cpuidInfo, 0x80000000);
uint32_t maxCpuIdEx = static_cast<uint32_t>(cpuidInfo[EAX]);
if (maxCpuIdEx >= 0x80000001)
{
__cpuid(cpuidInfo, 0x80000001);
if ((cpuidInfo[ECX] & (1 << 5)) != 0) // LZCNT
{
g_cpuFeatures |= XArchIntrinsicConstants_Lzcnt;
}
#ifdef HOST_AMD64
// AMD has a "fast" mode for fxsave/fxrstor, which omits the saving of xmm registers. The OS will enable this mode
// if it is supported. So if we continue to use fxsave/fxrstor, we must manually save/restore the xmm registers.
// fxsr_opt is bit 25 of EDX
if ((cpuidInfo[EDX] & (1 << 25)) != 0)
g_fHasFastFxsave = true;
#endif
}
#endif // HOST_X86 || HOST_AMD64
#if defined(HOST_ARM64)
PAL_GetCpuCapabilityFlags (&g_cpuFeatures);
#endif
if ((g_cpuFeatures & g_requiredCpuFeatures) != g_requiredCpuFeatures)
{
PalPrintFatalError("\nThe required instruction sets are not supported by the current CPU.\n");
RhFailFast();
}
#endif // HOST_X86|| HOST_AMD64 || HOST_ARM64
return true;
}
#endif // !USE_PORTABLE_HELPERS

Into a place that can be shared with the JitInterface native library:

https://github.com/dotnet/runtime/tree/cdf21f143735b8d104c8e636a37eb068904cdd8b/src/coreclr/tools/aot/jitinterface

Then compile that into jitinterface.dll (that ships with ILC) and p/invoke into this.

We already have managed definitions of the various flags this returns because the computed values are bitmasked with compile-time expectations burned into the produced executable to ensure we don't run on machines that don't have expected CPU features.

As a stretch goal, we might try to unify this detection with what's in CoreCLR VM, but that might be too much extra scope. Extracting something that would be eligible to be placed under src/native/minipal in the repo would be a very good first step towards that.

@jkotas jkotas added the help wanted [up-for-grabs] Good issue for external contributors label Aug 3, 2022
@MichalStrehovsky MichalStrehovsky modified the milestones: Future, 8.0.0 Feb 21, 2023
@JamesNK
Copy link
Member

JamesNK commented Feb 21, 2023

Performance hit was noticed when testing Native AOT gRPC app on Linux ARM.

AOT vs CoreCLR:
MicrosoftTeams-image (3)

Compared to a minor perf hit of AOT on Linux Intel:
MicrosoftTeams-image (4)

Probably culprit is EventSource methods that use Interlocked to increment longs:
https://github.com/grpc/grpc-dotnet/blob/0b365bf4633c9f05d0af374ed8607c046e8e74dd/src/Grpc.AspNetCore.Server/Internal/GrpcEventSource.cs#L67-L75

@EgorBo
Copy link
Member

EgorBo commented Feb 21, 2023

@JamesNK that makes sense, NativeAOT uses arm64 8.0 as a baseline while atomic instructions require 8.1, so you need to define lse capability for NativeAOT, e.g.: <IlcInstructionSet>lse</IlcInstructionSet> or

--application.buildArguments \"/p:IlcInstructionSet=lse\"

for crank

I think a while ago we discussed about a named instruction set for Azure (to include the baseline instructions)

@JamesNK
Copy link
Member

JamesNK commented Feb 21, 2023

Yes, that fixed it.

Before: 239,492 RPS
After: 849,835 RPS

Also, only using Interlocked when required with this gRPC PR - grpc/grpc-dotnet#2052 - will improve performance in the benchmark.

@omariom
Copy link
Contributor

omariom commented Mar 8, 2023

@JamesNK If the effect this large then may be the hottest counters should be placed on their own cache lines?

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Jun 21, 2023
This allows compiling for the ISA extensions that the currently running CPU supports.

Fixes dotnet#73246.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 21, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 20, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr help wanted [up-for-grabs] Good issue for external contributors
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants