Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT generates suboptimal instructions for 2D vector struct #11940

Closed
ghost opened this issue Jan 31, 2019 · 9 comments
Closed

JIT generates suboptimal instructions for 2D vector struct #11940

ghost opened this issue Jan 31, 2019 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@ghost
Copy link

ghost commented Jan 31, 2019

I've been benchmarking various approaches to implement basic math struct operators for projects like FNA and MonoGame and among the results was the finding that 2D vectors were across the board slower than 3D and 4D. After some digging I isolated the tests and got the disassembly which shows that a struct with 3 float fields generates far more optimized instructions than a struct with 2 floats. This leads to the 2D vector being roughly 6x slower than the 3D vector.

The code for the entire benchmark can be found in https://github.com/nickgravelyn/math-struct-benchmark/blob/vector2-focus/Program.cs and the disassembly generated can be viewed in https://github.com/nickgravelyn/math-struct-benchmark/tree/vector2-focus#disassembly

category:cq
theme:structs
skill-level:expert
cost:medium

@ghost
Copy link
Author

ghost commented Jan 31, 2019

Additional note was that this puts RyuJIT quite a bit slower than the 32-bit legacy JIT which, by virtue of not generating SIMD instructions, seems to be quite a bit faster. In fact the 32-bit legacy JIT produced a benchmark for the 2D vector on par with the 3D vector using RyuJIT (and naturally the 3D vector on the legacy was slightly slower but not substantially).

@mikedn
Copy link
Contributor

mikedn commented Jan 31, 2019

It seems to me that this has nothing to do with SIMD, it's all scalar floating point code.

@mikedn
Copy link
Contributor

mikedn commented Jan 31, 2019

This is a variation of #4323. In the Win x64 ABI 8 byte structs are supposed to be returned in an integer register and the JIT doesn't handle this case well - it forces the struct into memory.

Also related to #11848 (the generated IR is basically the same as in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456531059, but with float field types instead of int field types.)

cc @AndyAyersMS @CarolEidt

@ghost
Copy link
Author

ghost commented Jan 31, 2019

I might be totally off as I’m fairly inexperienced with native assembly and SIMD but my understanding was that things like ‘vaddss’ and the use of the ‘xmm’ registers fell under the umbrella of SIMD.

In either case the JIT is producing code that is roughly 6x slower for a smaller data type that, conceptually, is doing fewer operations. So whether it’s SIMD or not it still seems like an issue with the code generation, especially when the legacy 32-bit JIT outperforms on this same benchmark.

@mikedn
Copy link
Contributor

mikedn commented Jan 31, 2019

I might be totally off as I’m fairly inexperienced with native assembly and SIMD but my understanding was that things like ‘vaddss’ and the use of the ‘xmm’ registers fell under the umbrella of SIMD.

vaddss is scalar addition - Add Single precision Scalar. The packed versions (vaddps/vaddpd) are SIMD. XMM registers are used for both normal scalar floating point operation and SIMD operations.

In either case the JIT is producing code that is roughly 6x slower for a smaller data type that, conceptually, is doing fewer operations. So whether it’s SIMD or not it still seems like an issue with the code generation, especially when the legacy 32-bit JIT outperforms on this same benchmark.

Yes, certainly. Note that the issue is specific to 64-bit. 32-bit (x86) RyuJIT produces code that's similar to the legacy 32-bit JIT.

@ghost ghost changed the title JIT generates suboptimal SIMD instructions for 2D vector struct JIT generates suboptimal instructions for 2D vector struct Jan 31, 2019
@ghost
Copy link
Author

ghost commented Jan 31, 2019

Ah thanks for clarifying. I’ve renamed the issue accordingly.

@mikedn
Copy link
Contributor

mikedn commented Jan 31, 2019

The same fix mentioned in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456554914 results in the expected code being generated:

G_M55889_IG01:
       C5F877               vzeroupper
       6690                 nop
G_M55889_IG02:
       C5FA100533000000     vmovss   xmm0, dword ptr [reloc @RWD00]
       C5FA100D2F000000     vmovss   xmm1, dword ptr [reloc @RWD04]
       B8E8030000           mov      eax, 0x3E8
       EB1A                 jmp      SHORT G_M55889_IG04
G_M55889_IG03:
       C5FA580524000000     vaddss   xmm0, dword ptr [reloc @RWD08]
       C5F2580D20000000     vaddss   xmm1, dword ptr [reloc @RWD12]
       8BC2                 mov      eax, edx
G_M55889_IG04:
       8D50FF               lea      edx, [rax-1]
       85C0                 test     eax, eax
       7FE7                 jg       SHORT G_M55889_IG03
G_M55889_IG05:
       C3                   ret

Though I think I'd prefer not generating that LCL_FLD in the first place, rather than attempting to deal with it after the fact.

Or, generate LCL_FLD but don't force the variable in memory. That would be necessary to get good codegen when the method is not inlined. In that case the 2 float fields could be extracted from the integer register using something like:

movd xmm0, eax
shr rax, 32
movd xmm1, eax

@AndyAyersMS
Copy link
Member

We may get to this for 3.0, but marking as Future.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt CarolEidt self-assigned this Oct 23, 2020
@CarolEidt CarolEidt modified the milestones: Future, 5.0.0 Oct 23, 2020
@CarolEidt
Copy link
Contributor

This was fixed in 5.0 by improvements for enregistering small structs.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants