Improve struct promotion for 256-bit SIMD fields #19663

fiigii · 2018-08-24T21:24:09Z

This PR improves struct promotion to unwrap more 256-bit SIMD fields, which makes PacketTracer benchmark 31% faster with #19662

Performance data (rendering a 2k image)

Execution time	Windows	Linux
PacketTracer (class)	1.20s	1.35s
PacketTracer (struct)	0.83s	0.93s
Performance Gains	31%	31%

The data collected on

Intel Core i9 7900X (Skylake-X) @ 3.3GHz, HT on, Turbo on, 16GB DDR4 2666MHz
Windows 10 and Ubuntu 16.04

VTune characterization (module level)

Windows

Linux

The most obvious module-level change is the runtime (GC) overhead gets reduced (~33% -> ~11%) and managed code also gets better path-length (code size).

VTune characterization (managed code)

Windows

Linux

Overall, managed code gets improvement by the better code size, but there still are some inefficient codgen that I will continue to investigate and open other issues to discuss (mainly related to https://github.com/dotnet/coreclr/issues/16619)

VTune characterization (CoreCLR runtime)

Windows

Linux

fiigii · 2018-08-24T21:25:22Z

@CarolEidt @AndyAyersMS @tannergooding @mikedn PTAL

tannergooding · 2018-08-24T21:28:08Z

Do you have some assembly diffs you can share?

tannergooding · 2018-08-24T21:30:21Z

src/jit/lclvars.cpp

-        const int MaxOffset = MAX_NumOfFieldsInPromotableStruct * XMM_REGSIZE_BYTES;
+        // This will allow promotion of 4 Vector<T> fields on AVX2 or Vector256<T> on AVX, 
+        // or 8 Vector<T>/Vector128<T> fields on SSE2.
+        const int MaxOffset = MAX_NumOfFieldsInPromotableStruct * YMM_REGSIZE_BYTES;


How does this impact machines without AVX support?

I did not detect any impact from the Vector3 benchmark.

fiigii · 2018-08-24T21:30:32Z

I have run jit-diff on this change, and it shows no any difference in corelib/tests/frameworks.

Although jit-diff uses crossgen that does not work with SIMD code, we can say this change has no impact on the current scalar code base.

I also measured RayTracer (Vector3 benchmark), which has no execution time regression.

tannergooding · 2018-08-24T21:33:17Z

Although jit-diff uses crossgen that does not work with SIMD code

Have you also tried to get the pmi diffs? CC. @AndyAyersMS

fiigii · 2018-08-24T21:51:13Z

Have you also tried to get the pmi diffs?

Will try later, but there seems no managed code with more than 4 SIMD16 or 2 SIMD32 struct fields.

CarolEidt

LGTM

CarolEidt · 2018-08-25T00:36:07Z

@dotnet/jit-contrib - I'd like to have another JIT dev weigh in on this.
@fiigii - just to be conservative I'd like to see the pmi diffs.

tannergooding · 2018-08-25T01:39:20Z

Will try later, but there seems no managed code with more than 4 SIMD16 or 2 SIMD32 struct fields.

It might be useful/interesting to create a simple 5x4 matrix struct and see what the codegen diff looks like.

Just because CoreFX doesn't have any code that leverages it, doesn't mean other libraries don't (and we don't want to accidentally regress them).

4creators · 2018-08-25T12:01:28Z

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test
@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test

AndyAyersMS · 2018-08-25T16:02:40Z

Because we generally only promote structs with primitive typed fields it's hard to get suitably large field offsets for structs with small numbers of fields. Aside from SIMD it would require a fixed field or an explicit layout. And I would guess we don't have very many of these cases floating around in the framework code (otherwise we might have spotted #19149 sooner).

So you should try PMI across the test suite, but even there jit-diffs won't look as broadly as one might hope.

We could also try an SPMI run on desktop I suppose.

fiigii · 2018-08-27T17:55:07Z

@CarolEidt @AndyAyersMS @tannergooding I have run pmi diff, no difference

Analyzing diffs...
PMI Diffs for assemblies in D:\workspace\coreclr\bin\tests\Windows_NT.x64.Release for  default jit
Summary:
(Lower is better)
Total bytes of diff: 0 (NaN of base)
0 total files with size differences (0 improved, 0 regressed), 0 unchanged.
0 total methods with size differences (0 improved, 0 regressed), 0 unchanged.
Completed analysis in 2.96s

tannergooding · 2018-08-27T18:00:37Z

@fiigii, was this just for CoreCLR or did you also try the PMI diffs for the tests, CoreFX, and various benchmarks we have?

fiigii · 2018-08-27T18:19:52Z

Yes, I ran pmi diff on corelib/tests/frameworks/benchmarks (no diff from all of them). How to run jit-diff on CoreFX?

AndyAyersMS · 2018-08-27T18:36:36Z

@fiigii can you post the last line of the analysis, showing how many methods were examined, eg something like

3074 total methods with size differences (48 improved, 3026 regressed), 221530 unchanged.

because from the above it looks like things ran too fast and maybe didn't look at any methods at all.

AndyAyersMS · 2018-08-27T18:44:03Z

Or maybe you already did? And the number is zero? It should be ~380K.

To run PMI in its most general mode, make sure you've built the tests, and then do something like this (note the -f):

jit-diff diff --pmi --base --base_root ... --diff -f --test_root D:\workspace\coreclr\bin\tests\Windows_NT.x64.Release

The summary should start with something like:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies, assemblies in d:\repos\coreclr\bin\tests\Windows_NT.x64.Release for x64 default jit

fiigii · 2018-08-27T18:46:23Z

@AndyAyersMS thanks for the guides, will re-run to make sure.

fiigii · 2018-08-27T21:24:11Z

@AndyAyersMS @CarolEidt @tannergooding I re-ran the PMI diff, it showed some difference (improvement). Corssgen diff still has no any diff.
The above PMI diff #19663 (comment) has some build errors, sorry for the mistake.

The new PMI diff result should be correct.

Corelib (no diff):

PS D:\workspace\coreclr-struct> jit-diff diff --diff --base --base_root D:\workspace\coreclr --pmi
Using --output D:\workspace\coreclr-struct\bin\diffs
Using --base D:\workspace\coreclr\bin\Product\Windows_NT.x64.Checked
Using --diff D:\workspace\coreclr-struct\bin\Product\Windows_NT.x64.Checked
Using --arch x64
Using --core_root D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root

Warning: it is best practice to use a Release build for --core_root, --crossgen, and --test_root.

No assemblies specified; defaulting to corelib
Beginning PMI Diffs for System.Private.CoreLib.dll
\ Finished 1/1 Base 1/1 Diff [154.1 sec]
Completed PMI Diffs for System.Private.CoreLib.dll in 154.11s
Diffs (if any) can be viewed by comparing: D:\workspace\coreclr-struct\bin\diffs\dasmset_4\base D:\workspace\coreclr-str
uct\bin\diffs\dasmset_4\diff
Analyzing diffs...
PMI Diffs for System.Private.CoreLib.dll for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: 0 (0.00% of base)
0 total files with size differences (0 improved, 0 regressed), 1 unchanged.
0 total methods with size differences (0 improved, 0 regressed), 22188 unchanged.
Completed analysis in 1.67s

Tests (improvement):

PS D:\workspace\coreclr-struct> jit-diff diff --diff --base --base_root D:\workspace\coreclr --tests --pmi
Using --output D:\workspace\coreclr-struct\bin\diffs
Using --base D:\workspace\coreclr\bin\Product\Windows_NT.x64.Checked
Using --diff D:\workspace\coreclr-struct\bin\Product\Windows_NT.x64.Checked
Using --arch x64
Using --core_root D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root
Using --test_root D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked

Warning: it is best practice to use a Release build for --core_root, --crossgen, and --test_root.

Beginning PMI Diffs for assemblies in D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked
\ Finished 3040/3040 Base 3040/3040 Diff [4090.8 sec]
Completed PMI Diffs for assemblies in D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked in 4091.97s
Diffs (if any) can be viewed by comparing: D:\workspace\coreclr-struct\bin\diffs\dasmset_3\base D:\workspace\coreclr-str
uct\bin\diffs\dasmset_3\diff
Analyzing diffs...
Found 57 files with textual diffs.
PMI Diffs for assemblies in D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -16826 (-0.02% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -314 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm (-0.15% of
 base)
        -314 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm (-0.15% of base)
        -314 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm (-0.15
% of base)
        -314 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.das
m (-0.15% of base)
        -314 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrdere
dScalar_ro.dasm (-0.15% of base)
57 total files with size differences (57 improved, 0 regressed), 2983 unchanged.
Top method regressions by size (bytes):
          24 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm - IntelHar
dwareIntrinsicTest.Program:PrintError(struct,ref,ref,ref) (56 methods)
          24 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm - IntelHardwareIntrinsic
Test.Program:PrintError(struct,ref,ref,ref) (56 methods)
          24 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm - Inte
lHardwareIntrinsicTest.Program:PrintError(struct,ref,ref,ref) (56 methods)
          24 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.das
m - IntelHardwareIntrinsicTest.Program:PrintError(struct,ref,ref,ref) (56 methods)
          24 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrdere
dScalar_ro.dasm - IntelHardwareIntrinsicTest.Program:PrintError(struct,ref,ref,ref) (56 methods)
Top method improvements by size (bytes):
        -136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm - IntelHar
dwareIntrinsicTest.TestTableSse2`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:CheckUnpack(ref
):bool:this (3 methods)
        -136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm - IntelHardwareIntrinsic
Test.TestTableSse2`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:CheckUnpack(ref):bool:this (3
 methods)
        -136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm - Inte
lHardwareIntrinsicTest.TestTableSse2`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:CheckUnpack
(ref):bool:this (3 methods)
        -136 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.das
m - IntelHardwareIntrinsicTest.TestTableSse2`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:Che
ckUnpack(ref):bool:this (3 methods)
        -136 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrdere
dScalar_ro.dasm - IntelHardwareIntrinsicTest.TestTableSse2`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],Sys
tem.Int64]:CheckUnpack(ref):bool:this (3 methods)
388 total methods with size differences (282 improved, 106 regressed), 226916 unchanged.
Completed analysis in 48.04s

Frameworks (improvement only)

PS D:\workspace\coreclr-struct> jit-diff diff --diff --base --base_root D:\workspace\coreclr --pmi -f
Using --output D:\workspace\coreclr-struct\bin\diffs
Using --base D:\workspace\coreclr\bin\Product\Windows_NT.x64.Checked
Using --diff D:\workspace\coreclr-struct\bin\Product\Windows_NT.x64.Checked
Using --arch x64
Using --core_root D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root

Warning: it is best practice to use a Release build for --core_root, --crossgen, and --test_root.

Beginning PMI Diffs for System.Private.CoreLib.dll, framework assemblies
Warning: can't find framework assembly D:\workspace\coreclr-struct\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root\xuni
t.runner.utility.dotnet.dll
\ Finished 129/129 Base 129/129 Diff [628.0 sec]
Completed PMI Diffs for System.Private.CoreLib.dll, framework assemblies in 628.04s
Diffs (if any) can be viewed by comparing: D:\workspace\coreclr-struct\bin\diffs\dasmset_5\base D:\workspace\coreclr-str
uct\bin\diffs\dasmset_5\diff
Analyzing diffs...
Found 1 files with textual diffs.
PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -59 (0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
         -59 : System.Reflection.Metadata.dasm (-0.01% of base)
1 total files with size differences (1 improved, 0 regressed), 128 unchanged.
Top method improvements by size (bytes):
         -24 : System.Reflection.Metadata.dasm - System.Reflection.Metadata.Ecma335.CustomAttributeDecoder`1[Vector`1][S
ystem.Numerics.Vector`1[System.Single]]:DecodeFixedArgumentType(byref,bool):struct:this
         -19 : System.Reflection.Metadata.dasm - System.Reflection.Metadata.Ecma335.CustomAttributeDecoder`1[Vector`1][S
ystem.Numerics.Vector`1[System.Single]]:DecodeArrayArgument(byref,struct):struct:this
         -16 : System.Reflection.Metadata.dasm - System.Reflection.Metadata.Ecma335.CustomAttributeDecoder`1[Vector`1][S
ystem.Numerics.Vector`1[System.Single]]:DecodeNamedArgumentType(byref,bool):struct:this
3 total methods with size differences (3 improved, 0 regressed), 224600 unchanged.
Completed analysis in 13.88s

fiigii · 2018-08-27T21:51:10Z

The small regressions in the above test pmi diff are mainly from expanding call CORINFO_HELP_MEMSET to vmovupd sequence.
I think this can be eliminated by https://github.com/dotnet/coreclr/issues/16619 in some senarios.

-mov      rbp, rax
-lea      rcx, bword ptr [rbp+8]
-lea      rdx, bword ptr [rsp+C0H]
-lea      rdx, bword ptr [rsp+C0H]
-mov      r8d, 128
-call     CORINFO_HELP_MEMCPY
+lea      r8, bword ptr [rax+8]
+vmovupd  ymm0, ymmword ptr[rsp+C0H]
+vmovupd  ymmword ptr[r8], ymm0
+vmovupd  ymm0, ymmword ptr[rsp+E0H]
+vmovupd  ymmword ptr[r8+32], ymm0
+vmovupd  ymm0, ymmword ptr[rsp+100H]
+vmovupd  ymmword ptr[r8+64], ymm0
+vmovupd  ymm0, ymmword ptr[rsp+120H]
+vmovupd  ymmword ptr[r8+96], ymm0

fiigii · 2018-08-27T23:58:14Z

BTW, PacketTracer benchmark #19662 gets 16.26% code size shrink.

PMI Diffs for PacketTracer.dll for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -6936 (-16.26% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -6936 : PacketTracer.dasm (-16.26% of base)
1 total files with size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements by size (bytes):
       -1775 : PacketTracer.dasm - Packet256Tracer:GetNaturalColor(struct,byref,byref,byref,ref):struct:this
       -1222 : PacketTracer.dasm - Camera:Create(struct,struct):ref
       -1028 : PacketTracer.dasm - Packet256Tracer:Shade(byref,ref,ref,int):struct:this
        -821 : PacketTracer.dasm - Packet256Tracer:GetPoints(struct,struct,ref):struct:this
        -508 : PacketTracer.dasm - SpherePacket256:Intersect(ref):struct:this
27 total methods with size differences (27 improved, 0 regressed), 182 unchanged.

tannergooding · 2018-08-28T00:13:16Z

I got the following for CoreCLR:

CoreCLR x64 VEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 1 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 17075 unchanged.
CoreCLR x64 VEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 1 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 17677 unchanged.

CoreCLR x64 NoVEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 1 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 17075 unchanged.
CoreCLR x64 NoVEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 1 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 17677 unchanged.

tannergooding · 2018-08-28T00:13:23Z

I got the following for Framework:

Framework x64 VEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 129 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 142056 unchanged.
Framework x64 VEX PMI
	Total bytes of diff: -59 (0.00% of base)
		diff is an improvement.
	Top file improvements by size (bytes):
			 -59 : System.Reflection.Metadata.dasm (-0.01% of base)
	1 total files with size differences (1 improved, 0 regressed), 128 unchanged.
	Top method improvements by size (bytes):
			 -24 : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeFixedArgumentType(byref,bool):struct:this (5 methods)
			 -19 : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeArrayArgument(byref,struct):struct:this (5 methods)
			 -16 : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):struct:this (5 methods)
	3 total methods with size differences (3 improved, 0 regressed), 192869 unchanged.

Framework x64 NoVEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 129 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 142056 unchanged.
Framework x64 NoVEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 129 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 192872 unchanged.

tannergooding · 2018-08-28T00:13:29Z

I got the following for Tests:

Tests x64 VEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 2731 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 151675 unchanged.
Tests x64 VEX PMI
	Total bytes of diff: -16667 (-0.02% of base)
		diff is an improvement.
	Top file improvements by size (bytes):
			-311 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm (-0.15% of base)
			-311 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm (-0.15% of base)
			-311 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm (-0.15% of base)
			-311 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.dasm (-0.15% of base)
			-311 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrderedScalar_ro.dasm (-0.15% of base)
	57 total files with size differences (57 improved, 0 regressed), 2942 unchanged.
	Top method regressions by size (bytes):
			  27 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm - Program:PrintError(struct,ref,ref,ref) (56 methods)
			  27 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm - Program:PrintError(struct,ref,ref,ref) (56 methods)
			  27 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm - Program:PrintError(struct,ref,ref,ref) (56 methods)
			  27 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.dasm - Program:PrintError(struct,ref,ref,ref) (56 methods)
			  27 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrderedScalar_ro.dasm - Program:PrintError(struct,ref,ref,ref) (56 methods)
	Top method improvements by size (bytes):
			-136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualOrderedScalar_ro\CompareEqualOrderedScalar_ro.dasm - TestTableSse2`2:CheckUnpack(ref):bool:this (12 methods)
			-136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualScalar_ro\CompareEqualScalar_ro.dasm - TestTableSse2`2:CheckUnpack(ref):bool:this (12 methods)
			-136 : JIT\HardwareIntrinsics\X86\Sse2\CompareEqualUnorderedScalar_ro\CompareEqualUnorderedScalar_ro.dasm - TestTableSse2`2:CheckUnpack(ref):bool:this (12 methods)
			-136 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrderedScalar_ro\CompareGreaterThanOrderedScalar_ro.dasm - TestTableSse2`2:CheckUnpack(ref):bool:this (12 methods)
			-136 : JIT\HardwareIntrinsics\X86\Sse2\CompareGreaterThanOrEqualOrderedScalar_ro\CompareGreaterThanOrEqualOrderedScalar_ro.dasm - TestTableSse2`2:CheckUnpack(ref):bool:this (12 methods)
	388 total methods with size differences (282 improved, 106 regressed), 192448 unchanged.

Tests x64 NoVEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 2731 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 151675 unchanged.
Tests x64 NoVEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 2998 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 192708 unchanged.
	3 files had text diffs but not size diffs.
	JIT\HardwareIntrinsics\X86\Fma_Vector256\Fma_ro\Fma_ro.dasm had 204 diffs
	JIT\HardwareIntrinsics\X86\Avx\Avx_ro\Avx_ro.dasm had 34 diffs
	JIT\HardwareIntrinsics\X86\Avx2\Avx2_ro\Avx2_ro.dasm had 34 diffs

tannergooding · 2018-08-28T00:34:14Z

I got the following for Benchmarks:

Benchmarks x64 VEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 82 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 1778 unchanged.
Benchmarks x64 VEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 82 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 1878 unchanged.

Benchmarks x64 NoVEX
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 82 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 1778 unchanged.
Benchmarks x64 NoVEX PMI
	Total bytes of diff: 0 (0.00% of base)
	0 total files with size differences (0 improved, 0 regressed), 82 unchanged.
	0 total methods with size differences (0 improved, 0 regressed), 1878 unchanged.

tannergooding · 2018-08-28T02:53:05Z

The files that had diffs:
framework-x64-vex-pmi.zip
tests-x64-novex-pmi.zip
tests-x64-vex-pmi.7z.zip -- Actually a 7z file (otherwise it was 36MB, rather than 6MB)

fiigii · 2018-08-28T17:13:40Z

@AndyAyersMS @CarolEidt Does the data look good to you?

CarolEidt · 2018-08-28T17:39:09Z

The results look good, and as expected. x86 diffs might be nice, but I don't think they're necessary.
@tannergooding you haven't yet approved - is there something else you'd like to see?
@AndyAyersMS - do you have any remaining concerns?
(I've already approved)

AndyAyersMS · 2018-08-28T17:47:39Z

No, no concerns.

tannergooding · 2018-08-28T17:54:30Z

No, just wanted to make sure we had any regressions covered.

tannergooding · 2018-08-28T17:55:25Z

x86 diffs might be nice, but I don't think they're necessary.

I'm working on getting x86 diffs as well, and so far they look much the same as the x64 diffs.

tannergooding · 2018-08-29T01:13:43Z

@CarolEidt, @AndyAyers, @fiigii. Should we get diffs again with TieredJitting disabled? (I just spent way too long debugging another issue, only to find out it wasnt working because TieredJitting disabled that optimization).

AndyAyersMS · 2018-08-29T01:19:55Z

PMI (when run via jit-dasm-pmi, which in turn is run via jit-diff) disables tiered jitting already.

If you run PMI directly via corerun then you might need to set some env vars first.

tannergooding · 2018-08-29T01:27:23Z

Good to know. (I was using COMPlus_JitDisasm for my other case, so I had to explicitly set the env variable).

tannergooding · 2018-08-29T16:31:31Z

@CarolEidt, @AndyAyersMS. I'm merging this, since we've all signed off already.

fiigii mentioned this pull request Aug 24, 2018

Using struct for VectorPacket in PacketTracer benchmark #19662

Merged

tannergooding reviewed Aug 24, 2018

View reviewed changes

Improve struct promotion for 256-bit SIMD fields

acb9402

fiigii force-pushed the struct-prom branch from 61ef87f to acb9402 Compare August 24, 2018 21:49

CarolEidt approved these changes Aug 25, 2018

View reviewed changes

AndyAyersMS approved these changes Aug 28, 2018

View reviewed changes

tannergooding approved these changes Aug 28, 2018

View reviewed changes

tannergooding merged commit f99ca8c into dotnet:master Aug 29, 2018

fiigii deleted the struct-prom branch August 30, 2018 20:41

fiigii mentioned this pull request Oct 4, 2018

Document describing upcoming object stack allocation work. #20251

Merged

fiigii mentioned this pull request Jan 16, 2019

Add SoA raytracer as a CQ test for Intel hardware intrinsic #18839

Merged

AartBluestoke mentioned this pull request Jan 13, 2020

Optimistic allocation of objects on the stack dotnet/runtime#1661

Closed

Improve struct promotion for 256-bit SIMD fields #19663

Improve struct promotion for 256-bit SIMD fields #19663

Conversation

fiigii commented Aug 24, 2018

Performance data (rendering a 2k image)

VTune characterization (module level)

Windows

Linux

VTune characterization (managed code)

Windows

Linux

VTune characterization (CoreCLR runtime)

Windows

Linux

fiigii commented Aug 24, 2018

tannergooding commented Aug 24, 2018

tannergooding Aug 24, 2018

Choose a reason for hiding this comment

fiigii Aug 24, 2018

Choose a reason for hiding this comment

fiigii commented Aug 24, 2018

tannergooding commented Aug 24, 2018

fiigii commented Aug 24, 2018 • edited Loading

CarolEidt left a comment

Choose a reason for hiding this comment

CarolEidt commented Aug 25, 2018

tannergooding commented Aug 25, 2018 • edited Loading

4creators commented Aug 25, 2018

AndyAyersMS commented Aug 25, 2018

fiigii commented Aug 27, 2018

tannergooding commented Aug 27, 2018

fiigii commented Aug 27, 2018 • edited Loading

AndyAyersMS commented Aug 27, 2018

AndyAyersMS commented Aug 27, 2018

fiigii commented Aug 27, 2018

fiigii commented Aug 27, 2018

fiigii commented Aug 27, 2018 • edited Loading

fiigii commented Aug 27, 2018

tannergooding commented Aug 28, 2018

tannergooding commented Aug 28, 2018 • edited Loading

tannergooding commented Aug 28, 2018 • edited Loading

tannergooding commented Aug 28, 2018

tannergooding commented Aug 28, 2018

fiigii commented Aug 28, 2018

CarolEidt commented Aug 28, 2018

AndyAyersMS commented Aug 28, 2018

tannergooding commented Aug 28, 2018

tannergooding commented Aug 28, 2018

tannergooding commented Aug 29, 2018

AndyAyersMS commented Aug 29, 2018

tannergooding commented Aug 29, 2018

tannergooding commented Aug 29, 2018

fiigii commented Aug 24, 2018 •

edited

Loading

tannergooding commented Aug 25, 2018 •

edited

Loading

fiigii commented Aug 27, 2018 •

edited

Loading

fiigii commented Aug 27, 2018 •

edited

Loading

tannergooding commented Aug 28, 2018 •

edited

Loading

tannergooding commented Aug 28, 2018 •

edited

Loading