Optimize stackalloc zeroing on arm64 via STORE_BLK#121986
Optimize stackalloc zeroing on arm64 via STORE_BLK#121986EgorBo merged 3 commits intodotnet:mainfrom
Conversation
|
@EgorBot -arm using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Benchmarks
{
[Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
[Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
[Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
[Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
[Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
[Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<byte> x){}
} |
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR optimizes stackalloc zeroing on ARM64 by enabling the same STORE_BLK optimization that already exists for X64. When the allocation size is a constant, the lowering phase now takes responsibility for clearing memory via an unrolled STORE_BLK node, allowing the backend to skip loop-based zeroing and use more efficient SIMD instructions.
Key changes:
- Enables Lower's STORE_BLK optimization for constant-sized stackalloc on ARM64
- Introduces
clearMemorylocal variable to track whether backend should clear memory - Updates register allocation and code generation to skip clearing when Lower handles it
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/jit/lower.cpp | Extends the constant-sized LCLHEAP optimization to TARGET_ARM64 |
| src/coreclr/jit/lsraarm64.cpp | Updates register allocation to track when Lower handles memory clearing |
| src/coreclr/jit/codegenarm64.cpp | Updates code generation to skip clearing when Lower took responsibility |
|
The superpmi-replay asserts look related |
c61e795 to
8746f45
Compare
8746f45 to
006fe15
Compare
|
@jakobbotsch @dotnet/jit-contrib PTAL So today if the Size is a constant and it's contained it means it's either already cleared by GT_STORE_BLK or initMem is false. It may be not contained if it's too big (GT_STORE_BLK is effectively limited with 4GB while LCLHEAP accepts For all size it seems to be a clear win (for 32b and less we don't emit LCLHEAP and convert it to locals instead) |
Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
Enable X64's optimization where we clear LCLHEAP via STORE_BLK inserted in Lower on arm64.
was:
now:
Also, for larger sizes the previous logic used to emit a slow loop (e.g. 1024 bytes):
Now it will emit a call to
CORINFO_HELP_MEMZEROBenchmarks.