[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

fiigii · 2018-07-24T20:56:48Z

According to the VTune characterization dotnet/coreclr#18839 (comment), SoA SIMD programs have higher GC overhead than AoS and scalar programs because of temp object allocation.

SoA SIMD programs use VectorPacket as the primitive data type (Note, here VectorPacket is a reference type class)

class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
}

And each VectorPacket operation is immutable that returns a new VectorPacket as the result.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VectorPacket256 operator -(VectorPacket256 left, VectorPacket256 right)
{
    return new VectorPacket256(Subtract(left.Xs, right.Xs), Subtract(left.Ys, right.Ys), Subtract(left.Zs, right.Zs));
}

This semantic makes a lot of temp object allocations, for example, there are two VectorPacket operations in the code segment below

    private ColorPacket256 GetNaturalColor(Vector256<int> things, VectorPacket256 pos, VectorPacket256 norms, VectorPacket256 rds, Scene scene)
    {
        var colors = ColorPacket256Helper.DefaultColor;
        for (int i = 0; i < scene.Lights.Length; i++)
        {
            var lights = scene.Lights[i];
            var zero = SetZeroVector256<float>();
            var colorPacket = lights.Colors;
            VectorPacket256 ldis = lights.Positions - pos;   // VectorPacket256 operation
            VectorPacket256 livec = ldis.Normalize();        // VectorPacket256 operation
            var neatIsectDis = TestRay(new RayPacket256(pos, livec), scene);

These two lines will be compiled by RyuJIT to

vextractf128 xmm7, ymm6, 0x1		
call CORINFO_HELP_NEWSFAST  ;;; allocate the object	
vinsertf128 ymm6, ymm6, xmm7, 0x1
		
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	

vmovupd ymmword ptr [rax+0x8], ymm0	
vmovupd ymmword ptr [rax+0x28], ymm1	
vmovupd ymmword ptr [rax+0x48], ymm2	;;; Assigning the Subtract results to the new object

vmovupd ymm0, ymmword ptr [rax+0x8]		
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]	
vmovupd ymm1, ymmword ptr [rax+0x28]	
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]		
vmovupd ymm2, ymmword ptr [rax+0x48]	
mov qword ptr [rsp+0x50], rax		
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]		
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

However, the two commented blocks are unnecessary, and the ideal codegen could be

;;; No memory allocation for the intermediate object
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	
vmulps ymm0, ymm0, ymm0		
vmulps ymm1, ymm1, ymm1	
vmulps ymm2, ymm2, ymm2	
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

So introducing escape analysis https://github.com/dotnet/coreclr/issues/1784 and unwarping the local VectorPacket objects will significantly reduce the GC overhead of SIMD programs.

Additionally, the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

category:cq
theme:vector-codegen
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

fiigii · 2018-07-24T21:13:35Z

@CarolEidt @AndyAyersMS @tannergooding

jakobbotsch · 2018-07-24T21:21:32Z

Additionally, the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse proformance.

Have you checked how this looks if the operators/functions are changed to accept arguments by-ref? I.e. with in keyword.

fiigii · 2018-07-24T21:24:43Z

Have you checked how this looks if the operators/functions are changed to accept arguments by-ref? I.e. with in keyword.

Not yet, if we want to use struct, we indeed need in to pass-by-ref. However, the performance issue here is not related to pass-by-value.

fiigii · 2018-07-24T22:55:43Z

Updated the disasm with released mscorlib that does not call to System.Object.cotr. Thank @AndyAyersMS

AndyAyersMS · 2018-07-24T23:12:08Z

cc @erozenfeld, who has been looking at escape analysis recently....

AndyAyersMS · 2024-12-13T19:50:27Z

We now have enabled escape analysis, but it looks like the benchmark was rewritten to use structs. Might be an interesting exercise to switch back to using classes and see what happens.

Tagging this to #104936

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

fiigii commented Jul 24, 2018 •

edited by BruceForstall

Loading

fiigii commented Jul 24, 2018

jakobbotsch commented Jul 24, 2018

fiigii commented Jul 24, 2018

fiigii commented Jul 24, 2018

AndyAyersMS commented Jul 24, 2018

AndyAyersMS commented Dec 13, 2024

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

Comments

fiigii commented Jul 24, 2018 • edited by BruceForstall Loading

fiigii commented Jul 24, 2018

jakobbotsch commented Jul 24, 2018

fiigii commented Jul 24, 2018

fiigii commented Jul 24, 2018

AndyAyersMS commented Jul 24, 2018

AndyAyersMS commented Dec 13, 2024

fiigii commented Jul 24, 2018 •

edited by BruceForstall

Loading