Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] lack of escape analysis makes high GC overhead in SoA SIMD programs #10760

Open
fiigii opened this issue Jul 24, 2018 · 6 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Jul 24, 2018

According to the VTune characterization dotnet/coreclr#18839 (comment), SoA SIMD programs have higher GC overhead than AoS and scalar programs because of temp object allocation.

SoA SIMD programs use VectorPacket as the primitive data type (Note, here VectorPacket is a reference type class)

class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
}

And each VectorPacket operation is immutable that returns a new VectorPacket as the result.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VectorPacket256 operator -(VectorPacket256 left, VectorPacket256 right)
{
    return new VectorPacket256(Subtract(left.Xs, right.Xs), Subtract(left.Ys, right.Ys), Subtract(left.Zs, right.Zs));
}

This semantic makes a lot of temp object allocations, for example, there are two VectorPacket operations in the code segment below

    private ColorPacket256 GetNaturalColor(Vector256<int> things, VectorPacket256 pos, VectorPacket256 norms, VectorPacket256 rds, Scene scene)
    {
        var colors = ColorPacket256Helper.DefaultColor;
        for (int i = 0; i < scene.Lights.Length; i++)
        {
            var lights = scene.Lights[i];
            var zero = SetZeroVector256<float>();
            var colorPacket = lights.Colors;
            VectorPacket256 ldis = lights.Positions - pos;   // VectorPacket256 operation
            VectorPacket256 livec = ldis.Normalize();        // VectorPacket256 operation
            var neatIsectDis = TestRay(new RayPacket256(pos, livec), scene);

These two lines will be compiled by RyuJIT to

vextractf128 xmm7, ymm6, 0x1		
call CORINFO_HELP_NEWSFAST  ;;; allocate the object	
vinsertf128 ymm6, ymm6, xmm7, 0x1
		
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	

vmovupd ymmword ptr [rax+0x8], ymm0	
vmovupd ymmword ptr [rax+0x28], ymm1	
vmovupd ymmword ptr [rax+0x48], ymm2	;;; Assigning the Subtract results to the new object

vmovupd ymm0, ymmword ptr [rax+0x8]		
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]	
vmovupd ymm1, ymmword ptr [rax+0x28]	
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]		
vmovupd ymm2, ymmword ptr [rax+0x48]	
mov qword ptr [rsp+0x50], rax		
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]		
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

However, the two commented blocks are unnecessary, and the ideal codegen could be

;;; No memory allocation for the intermediate object
mov rcx, qword ptr [rsp+0x58]		
vmovupd ymm0, ymmword ptr [rcx+0x8]
vsubps ymm0, ymm0, ymmword ptr [rbx+0x8]	
vmovupd ymm1, ymmword ptr [rcx+0x28]	
vsubps ymm1, ymm1, ymmword ptr [rbx+0x28]		
vmovupd ymm2, ymmword ptr [rcx+0x48]	
vsubps ymm2, ymm2, ymmword ptr [rbx+0x48]	
vmulps ymm0, ymm0, ymm0		
vmulps ymm1, ymm1, ymm1	
vmulps ymm2, ymm2, ymm2	
vaddps ymm0, ymm0, ymm1	
vaddps ymm0, ymm0, ymm2
vsqrtps ymm7, ymm0

So introducing escape analysis https://github.com/dotnet/coreclr/issues/1784 and unwarping the local VectorPacket objects will significantly reduce the GC overhead of SIMD programs.

Additionally, the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

category:cq
theme:vector-codegen
skill-level:expert
cost:large
impact:medium

@fiigii
Copy link
Contributor Author

fiigii commented Jul 24, 2018

@jakobbotsch
Copy link
Member

Additionally, the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse proformance.

Have you checked how this looks if the operators/functions are changed to accept arguments by-ref? I.e. with in keyword.

@fiigii
Copy link
Contributor Author

fiigii commented Jul 24, 2018

Have you checked how this looks if the operators/functions are changed to accept arguments by-ref? I.e. with in keyword.

Not yet, if we want to use struct, we indeed need in to pass-by-ref. However, the performance issue here is not related to pass-by-value.

@fiigii
Copy link
Contributor Author

fiigii commented Jul 24, 2018

Updated the disasm with released mscorlib that does not call to System.Object.cotr. Thank @AndyAyersMS

@AndyAyersMS
Copy link
Member

cc @erozenfeld, who has been looking at escape analysis recently....

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020
@AndyAyersMS
Copy link
Member

We now have enabled escape analysis, but it looks like the benchmark was rewritten to use structs. Might be an interesting exercise to switch back to using classes and see what happens.

Tagging this to #104936

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Projects
None yet
Development

No branches or pull requests

5 participants