Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure loop alignment's performance impact on Microbenchmarks #44051

Closed
kunalspathak opened this issue Oct 30, 2020 · 10 comments
Closed

Measure loop alignment's performance impact on Microbenchmarks #44051

kunalspathak opened this issue Oct 30, 2020 · 10 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

As mentioned in #43227, one of the task we have identified for stabilizing the performance is performing loop alignment for hot loops. Since the alignment can be very sensitive for the performance, this issue tracks the investigation work we will be doing to measure the impact of loop alignment work on Microbenchmarks. All the findings will be tracked in this issue.

Currently, we are tracking the benchmarks that our performance team has identified to have effect of alignment. The list of issues can be seen at https://github.com/drewScoggins/performance-2/issues?q=is%3Aopen+is%3Aissue+label%3AAlignment

@kunalspathak kunalspathak self-assigned this Oct 30, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 30, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@kunalspathak kunalspathak added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Oct 30, 2020
@kunalspathak
Copy link
Member Author

All measurements done on:

Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

QuickSortSpan

The benchmark did regressed after my 32B method alignment changes as I pointed in DrewScoggins/performance-2#2319 (comment). I thought that the loop alignment changes would help but because of #43713 we don’t record the loops in the benchmarks in optLoopTable and they need to be present for loop alignment to kick in. So for now, there is no action to be taken. Once we address #43713, we can check this benchmark.

IniArray

This benchmark regressed after my 32B method alignment changes as I pointed out in DrewScoggins/performance-2#2267 (comment). I did tested my loop alignment changes on this benchmark and it gives us 2X speed improvement as I wrote in DrewScoggins/performance-2#2267 (comment).

SequenceCompareTo

This benchmark shows regression after my 32B method alignment changes but it is in the bimodal level. I have given my analysis in DrewScoggins/performance-2#2286 (comment), but there is not much actionable at this point. I didn’t get chance to verify my loop alignment changes on this one yet.

LoopReturn

This benchmark shows 20% win after my loop alignment changes. I have shared some data in DrewScoggins/performance-2#2674 (comment).

@kunalspathak
Copy link
Member Author

All measurements done on:

Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

CSieve

This benchmark shows around 6% improvement with loop alignment changes:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
CSieve 4.851 ms 0.0099 ms 0.0088 ms 4.849 ms 4.840 ms 4.871 ms - - - 8.02 KB
Assembly code G_M3098_IG04 and G_M3098_IG08
; =========================== 32B boundary ===========================
                                                ;; bbWeight=4    PerfScore 2.00
G_M3098_IG04:              ;; offset=0022H
 00007ffb`a5116b42        4D63C8               movsxd   r9, r8d
 00007ffb`a5116b45        42C644081001         mov      byte  ptr [rax+r9+16], 1
 00007ffb`a5116b4b        41FFC0               inc      r8d
 00007ffb`a5116b4e        4181F8FE1F0000       cmp      r8d, 0x1FFE
 00007ffb`a5116b55        7EEB                 jle      SHORT G_M3098_IG04
                                                ;; bbWeight=16    PerfScore 44.00
G_M3098_IG05:              ;; offset=0037H
 00007ffb`a5116b57        41B802000000         mov      r8d, 2
                                                ;; bbWeight=4    PerfScore 1.00
G_M3098_IG06:              ;; offset=003DH
 00007ffb`a5116b5d        4D63C8               movsxd   r9, r8d
; =========================== 32B boundary ===========================
 00007ffb`a5116b60        42807C081000         cmp      byte  ptr [rax+r9+16], 0
 00007ffb`a5116b66        7424                 je       SHORT G_M3098_IG10
                                                ;; bbWeight=16    PerfScore 52.00
G_M3098_IG07:              ;; offset=0048H
 00007ffb`a5116b68        478D0C00             lea      r9d, [r8+r8]
 00007ffb`a5116b6c        4181F9FE1F0000       cmp      r9d, 0x1FFE
 00007ffb`a5116b73        7F15                 jg       SHORT G_M3098_IG09
                                                ;; bbWeight=8    PerfScore 14.00
G_M3098_IG08:              ;; offset=0055H
 00007ffb`a5116b75        4D63D1               movsxd   r10, r9d
 00007ffb`a5116b78        42C644101000         mov      byte  ptr [rax+r10+16], 0
 00007ffb`a5116b7e        4503C8               add      r9d, r8d
; =========================== 32B boundary ===========================
 00007ffb`a5116b81        4181F9FE1F0000       cmp      r9d, 0x1FFE
 00007ffb`a5116b88        7EEB                 jle      SHORT G_M3098_IG08

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
CSieve 4.503 ms 0.0091 ms 0.0076 ms 4.500 ms 4.491 ms 4.516 ms - - - 8.02 KB
Assembly code G_M3098_IG04 and G_M3098_IG08
; =========================== 32B boundary ===========================
 00007ffb`a50f6f62                             align
 00007ffb`a50f6f62                             align
 00007ffb`a50f6f62                             align
                                                ;; bbWeight=4    PerfScore 5.00
G_M3098_IG04:              ;; offset=0022H
 00007ffb`a50f6f62        4D63C8               movsxd   r9, r8d
 00007ffb`a50f6f65        42C644081001         mov      byte  ptr [rax+r9+16], 1
 00007ffb`a50f6f6b        41FFC0               inc      r8d
 00007ffb`a50f6f6e        4181F8FE1F0000       cmp      r8d, 0x1FFE
 00007ffb`a50f6f75        7EEB                 jle      SHORT G_M3098_IG04
                                                ;; bbWeight=16    PerfScore 44.00
G_M3098_IG05:              ;; offset=0037H
 00007ffb`a50f6f77        41B802000000         mov      r8d, 2
                                                ;; bbWeight=4    PerfScore 1.00
G_M3098_IG06:              ;; offset=003DH
 00007ffb`a50f6f7d        4D63C8               movsxd   r9, r8d
; =========================== 32B boundary ===========================
 00007ffb`a50f6f80        42807C081000         cmp      byte  ptr [rax+r9+16], 0
 00007ffb`a50f6f86        7443                 je       SHORT G_M3098_IG10
                                                ;; bbWeight=16    PerfScore 52.00
G_M3098_IG07:              ;; offset=0048H
 00007ffb`a50f6f88        478D0C00             lea      r9d, [r8+r8]
 00007ffb`a50f6f8c        4181F9FE1F0000       cmp      r9d, 0x1FFE
 00007ffb`a50f6f93        7F34                 jg       SHORT G_M3098_IG09
 00007ffb`a50f6f95        6666660F1F840000000000 align
; =========================== 32B boundary ===========================
 00007ffb`a50f6fa0                             align
 00007ffb`a50f6fa0                             align
                                                ;; bbWeight=8    PerfScore 20.00
G_M3098_IG08:              ;; offset=0060H
 00007ffb`a50f6fa0        4D63D1               movsxd   r10, r9d
 00007ffb`a50f6fa3        42C644101000         mov      byte  ptr [rax+r10+16], 0
 00007ffb`a50f6fa9        4503C8               add      r9d, r8d
 00007ffb`a50f6fac        4181F9FE1F0000       cmp      r9d, 0x1FFE
 00007ffb`a50f6fb3        7EEB                 jle      SHORT G_M3098_IG08

HeapSort

I don't see much changes in this benchmark, however I do see loops getting aligned which makes me wonder if there is some data alignment that is going wrong.

Below is section of assembly code of Benchstone.BenchI.HeapSort:Inner before/after loop alignment changes.

Assembly code G_M60435_IG04 (before loop alignment)
G_M60435_IG04:              ;; offset=003EH
 00007ff8`d10e6ebe        453BD0               cmp      r10d, r8d
; =========================== 32B boundary ===========================
 00007ff8`d10e6ec1        0F83F3000000         jae      G_M60435_IG13
 00007ff8`d10e6ec7        4963F2               movsxd   rsi, r10d
 00007ff8`d10e6eca        8B74B110             mov      esi, dword ptr [rcx+4*rsi+16]
 00007ff8`d10e6ece        413BF1               cmp      esi, r9d
 00007ff8`d10e6ed1        7D26                 jge      SHORT G_M60435_IG05
 00007ff8`d10e6ed3        453BD8               cmp      r11d, r8d
 00007ff8`d10e6ed6        0F83DE000000         jae      G_M60435_IG13
 00007ff8`d10e6edc        4D63DB               movsxd   r11, r11d
 00007ff8`d10e6edf        4289749910           mov      dword ptr [rcx+4*r11+16], esi
; =========================== 32B boundary ===========================
 00007ff8`d10e6ee4        458BDA               mov      r11d, r10d
 00007ff8`d10e6ee7        458BD3               mov      r10d, r11d
 00007ff8`d10e6eea        41C1EA1F             shr      r10d, 31
 00007ff8`d10e6eee        4503D3               add      r10d, r11d
 00007ff8`d10e6ef1        41D1FA               sar      r10d, 1
 00007ff8`d10e6ef4        4585D2               test     r10d, r10d
 00007ff8`d10e6ef7        7FC5                 jg       SHORT G_M60435_IG04
                                                ;; bbWeight=16    PerfScore 188.00

Assembly code G_M60435_IG04 (before loop alignment)

G_M60435_IG04:              ;; offset=0040H
 00007ffb`a7d77060        453BD0               cmp      r10d, r8d
 00007ffb`a7d77063        0F8316010000         jae      G_M60435_IG13
 00007ffb`a7d77069        4963F2               movsxd   rsi, r10d
 00007ffb`a7d7706c        8B74B110             mov      esi, dword ptr [rcx+4*rsi+16]
 00007ffb`a7d77070        413BF1               cmp      esi, r9d
 00007ffb`a7d77073        7D26                 jge      SHORT G_M60435_IG05
 00007ffb`a7d77075        453BD8               cmp      r11d, r8d
 00007ffb`a7d77078        0F8301010000         jae      G_M60435_IG13
 00007ffb`a7d7707e        4D63DB               movsxd   r11, r11d
; =========================== 32B boundary ===========================
 00007ffb`a7d77081        4289749910           mov      dword ptr [rcx+4*r11+16], esi
 00007ffb`a7d77086        458BDA               mov      r11d, r10d
 00007ffb`a7d77089        458BD3               mov      r10d, r11d
 00007ffb`a7d7708c        41C1EA1F             shr      r10d, 31
 00007ffb`a7d77090        4503D3               add      r10d, r11d
 00007ffb`a7d77093        41D1FA               sar      r10d, 1
 00007ffb`a7d77096        4585D2               test     r10d, r10d
 00007ffb`a7d77099        7FC5                 jg       SHORT G_M60435_IG04

MulMatrix

I see slight improvement of approx. 1% (min/max improves as well).

Before:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
MulMatrix 330.7 ms 0.96 ms 0.80 ms 330.8 ms 329.9 ms 332.7 ms - - - 73.95 KB

After:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
MulMatrix 327.4 ms 0.96 ms 0.85 ms 327.6 ms 325.8 ms 328.5 ms - - - 73.95 KB

Below is some portion of assembly code for Benchstone.BenchI.MulMatrix:Inner.

Assembly code after loop alignment
G_M2371_IG05:              ;; offset=0075H
 00007ffb`a6d76dd5        4D63D9               movsxd   r11, r9d
 00007ffb`a6d76dd8        4E8B54D910           mov      r10, gword ptr [rcx+8*r11+16]
 00007ffb`a6d76ddd        498BF2               mov      rsi, r10
; ............................... 32B boundary ...............................
 00007ffb`a6d76de0        8B5E08               mov      ebx, dword ptr [rsi+8]
 00007ffb`a6d76de3        3BC3                 cmp      eax, ebx
 00007ffb`a6d76de5        0F83EF0A0000         jae      G_M2371_IG98
 00007ffb`a6d76deb        44894CBE10           mov      dword ptr [rsi+4*rdi+16], r9d
 00007ffb`a6d76df0        4A8B5CDA10           mov      rbx, gword ptr [rdx+8*r11+16]
 00007ffb`a6d76df5        8B7308               mov      esi, dword ptr [rbx+8]
 00007ffb`a6d76df8        3BC6                 cmp      eax, esi
 00007ffb`a6d76dfa        0F83DA0A0000         jae      G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0) 32B boundary ...............................
 00007ffb`a6d76e00        896CBB10             mov      dword ptr [rbx+4*rdi+16], ebp
 00007ffb`a6d76e04        4F8B5CD810           mov      r11, gword ptr [r8+8*r11+16]
 00007ffb`a6d76e09        8BF5                 mov      esi, ebp
 00007ffb`a6d76e0b        410374BA10           add      esi, dword ptr [r10+4*rdi+16]
 00007ffb`a6d76e10        413B4308             cmp      eax, dword ptr [r11+8]
 00007ffb`a6d76e14        0F83C00A0000         jae      G_M2371_IG98
 00007ffb`a6d76e1a        418974BB10           mov      dword ptr [r11+4*rdi+16], esi
 00007ffb`a6d76e1f        41FFC1               inc      r9d
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (inc: 2) 32B boundary ...............................
 00007ffb`a6d76e22        4183F94B             cmp      r9d, 75
 00007ffb`a6d76e26        7CAD                 jl       SHORT G_M2371_IG05
 ...
 ...
 G_M2371_IG65:              ;; offset=07BDH
 00007ffb`a6d7751d        4C63D0               movsxd   r10, eax
; ............................... 32B boundary ...............................
 00007ffb`a6d77520        4F8B54D010           mov      r10, gword ptr [r8+8*r10+16]
 00007ffb`a6d77525        413B7A08             cmp      edi, dword ptr [r10+8]
 00007ffb`a6d77529        0F83AB030000         jae      G_M2371_IG98
 00007ffb`a6d7752f        4863F7               movsxd   rsi, edi
 00007ffb`a6d77532        4D8D54B210           lea      r10, bword ptr [r10+4*rsi+16]
 00007ffb`a6d77537        418B32               mov      esi, dword ptr [r10]
 00007ffb`a6d7753a        4863D8               movsxd   rbx, eax
 00007ffb`a6d7753d        488B5CD910           mov      rbx, gword ptr [rcx+8*rbx+16]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 2) 32B boundary ...............................
 00007ffb`a6d77542        443B5B08             cmp      r11d, dword ptr [rbx+8]
 00007ffb`a6d77546        0F838E030000         jae      G_M2371_IG98
 00007ffb`a6d7754c        4963EB               movsxd   rbp, r11d
 00007ffb`a6d7754f        8B5CAB10             mov      ebx, dword ptr [rbx+4*rbp+16]
 00007ffb`a6d77553        498BE9               mov      rbp, r9
 00007ffb`a6d77556        3B7D08               cmp      edi, dword ptr [rbp+8]
 00007ffb`a6d77559        0F837B030000         jae      G_M2371_IG98
 00007ffb`a6d7755f        4C63F7               movsxd   r14, edi
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movsxd: 2) 32B boundary ...............................
 00007ffb`a6d77562        420FAF5CB510         imul     ebx, dword ptr [rbp+4*r14+16]
 00007ffb`a6d77568        03F3                 add      esi, ebx
 00007ffb`a6d7756a        418932               mov      dword ptr [r10], esi
 00007ffb`a6d7756d        FFC0                 inc      eax
 00007ffb`a6d7756f        83F84B               cmp      eax, 75
 00007ffb`a6d77572        7CA9                 jl       SHORT G_M2371_IG65
Assembly code after loop alignment

00007ffb`a6d870f9        0F1F8000000000       align
; ............................... 32B boundary ...............................
 00007ffb`a6d87100                             align
 00007ffb`a6d87100                             align
                                                ;; bbWeight=4    PerfScore 51.00
G_M2371_IG05:              ;; offset=0080H
 00007ffb`a6d87100        4D63D9               movsxd   r11, r9d
 00007ffb`a6d87103        4E8B54D910           mov      r10, gword ptr [rcx+8*r11+16]
 00007ffb`a6d87108        498BF2               mov      rsi, r10
 00007ffb`a6d8710b        8B5E08               mov      ebx, dword ptr [rsi+8]
 00007ffb`a6d8710e        3BC3                 cmp      eax, ebx
 00007ffb`a6d87110        0F833C0D0000         jae      G_M2371_IG98
 00007ffb`a6d87116        44894CBE10           mov      dword ptr [rsi+4*rdi+16], r9d
 00007ffb`a6d8711b        4A8B5CDA10           mov      rbx, gword ptr [rdx+8*r11+16]
; ............................... 32B boundary ...............................
 00007ffb`a6d87120        8B7308               mov      esi, dword ptr [rbx+8]
 00007ffb`a6d87123        3BC6                 cmp      eax, esi
 00007ffb`a6d87125        0F83270D0000         jae      G_M2371_IG98
 00007ffb`a6d8712b        896CBB10             mov      dword ptr [rbx+4*rdi+16], ebp
 00007ffb`a6d8712f        4F8B5CD810           mov      r11, gword ptr [r8+8*r11+16]
 00007ffb`a6d87134        8BF5                 mov      esi, ebp
 00007ffb`a6d87136        410374BA10           add      esi, dword ptr [r10+4*rdi+16]
 00007ffb`a6d8713b        413B4308             cmp      eax, dword ptr [r11+8]
 00007ffb`a6d8713f        0F830D0D0000         jae      G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 5) 32B boundary ...............................
 00007ffb`a6d87145        418974BB10           mov      dword ptr [r11+4*rdi+16], esi
 00007ffb`a6d8714a        41FFC1               inc      r9d
 00007ffb`a6d8714d        4183F94B             cmp      r9d, 75
 00007ffb`a6d87151        7CAD                 jl       SHORT G_M2371_IG05
 ...
 ...

 00007ffb`a6d87878        0F1F840000000000     align
; ............................... 32B boundary ...............................
 00007ffb`a6d87880                             align
 00007ffb`a6d87880                             align
                                                ;; bbWeight=16    PerfScore 172.00
G_M2371_IG65:              ;; offset=0800H
 00007ffb`a6d87880        4C63D0               movsxd   r10, eax
 00007ffb`a6d87883        4F8B54D010           mov      r10, gword ptr [r8+8*r10+16]
 00007ffb`a6d87888        413B7A08             cmp      edi, dword ptr [r10+8]
 00007ffb`a6d8788c        0F8350040000         jae      G_M2371_IG98
 00007ffb`a6d87892        4863F7               movsxd   rsi, edi
 00007ffb`a6d87895        4D8D54B210           lea      r10, bword ptr [r10+4*rsi+16]
 00007ffb`a6d8789a        418B32               mov      esi, dword ptr [r10]
 00007ffb`a6d8789d        4863D8               movsxd   rbx, eax
; ............................... 32B boundary ...............................
 00007ffb`a6d878a0        488B5CD910           mov      rbx, gword ptr [rcx+8*rbx+16]
 00007ffb`a6d878a5        443B5B08             cmp      r11d, dword ptr [rbx+8]
 00007ffb`a6d878a9        0F8333040000         jae      G_M2371_IG98
 00007ffb`a6d878af        4963EB               movsxd   rbp, r11d
 00007ffb`a6d878b2        8B5CAB10             mov      ebx, dword ptr [rbx+4*rbp+16]
 00007ffb`a6d878b6        498BE9               mov      rbp, r9
 00007ffb`a6d878b9        3B7D08               cmp      edi, dword ptr [rbp+8]
 00007ffb`a6d878bc        0F8320040000         jae      G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 2) 32B boundary ...............................
 00007ffb`a6d878c2        4C63F7               movsxd   r14, edi
 00007ffb`a6d878c5        420FAF5CB510         imul     ebx, dword ptr [rbp+4*r14+16]
 00007ffb`a6d878cb        03F3                 add      esi, ebx
 00007ffb`a6d878cd        418932               mov      dword ptr [r10], esi
 00007ffb`a6d878d0        FFC0                 inc      eax
 00007ffb`a6d878d2        83F84B               cmp      eax, 75
 00007ffb`a6d878d5        7CA9                 jl       SHORT G_M2371_IG65

Base64EncodeInPlace

I noticed that the benchmark code is aligned at 32B boundary. The loop inside the benchmark is big and it doesn't meet the current threshold of 96 bytes (3 * 32B chunks) to get aligned. However, looking at the bimodal behavior of test, it looks like this benchmark is hitting data alignment issue as I called out in DrewScoggins/performance-2#2291 (comment).

IndexerCheckPathLength

My method alignment changes affected this benchmark because the loop was earlier in one chunk of 32B and it got pushed to be in two chunks instead. As described here, DrewScoggins/performance-2#2290 (comment), the loop alignment should have got back the performance, but I am not sure if it helps because of call inside a loop. Other theory is it can be impact by data alignment as well because of allocation of string _s1 although I haven't tried to randomize alignment.

benchSparseMult

The loop alignment changes has no effect on loops because it is already aligned correctly. You can see the perf numbers in DrewScoggins/performance-2#2271 (comment). That makes me wonder if this is another memory alignment related issue that depends on the alignment of arrays created for the benchmark

System.Collections.IterateForEach.ImmutableArray

I am still investigating this one. As seen in my comments in DrewScoggins/performance-2#2285 (comment), the loop alignment won't help here because the loop is already aligned correctly.

System.Collections.ContainsTrue.Span

As seen in data I shared at DrewScoggins/performance-2#2284 (comment) , I see negligible improvement in this benchmark but was hoping to see some more. However, as Andy calls out:

One could argue a more representative test would search for the elements in some random order; in that case mispredicts for those inner compares might be a bigger factor as HW could not as easily learn their behavior.

System.Collections.ContainsKeyFalse<Int32, Int32>.ImmutableDictionary

This one surprisingly sees regression as I pointed in DrewScoggins/performance-2#2281 (comment) . Andy made a good point if it is suffering from JCC erratum. I am doing some experiments to test this out.

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @adamsitnik

@adamsitnik
Copy link
Member

@kunalspathak thanks for the great analysis and being so transparent with all the data!

I think that it would be valuable to run all the benchmarks we have with and without your loop alignment changes. Using the ResultsComparer could give us a good overview of how many benchmarks would improve and regress in total. The tool also allows getting top X best|worst differences so it could help us to find other benchmarks affected by the alignment.

git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f net6.0 --filter * --bdn-arguments "--artifacts C:\results\without" --corerun $pathToCoreRunWithoutLoopAlignmentChanges
py .\scripts\benchmarks_ci.py -f net6.0 --filter * --bdn-arguments "--artifacts C:\results\with" --corerun $pathToCoreRunWithLoopAlignmentChanges
cd src\tools\ResultsComparer\
dotnet run -c Release --base "C:\results\without" --diff "C:\results\with" --threshold 3% --noise 1ns

Please be warned, running all of them takes around 5-6 hours ;)

@kunalspathak
Copy link
Member Author

I think that it would be valuable to run all the benchmarks we have with and without your loop alignment changes.

Thanks a lot @adamsitnik for the suggestion. I ran the benchmarks for various heuristics that I was trying and the ResultComparer helped me in giving insights about the approach I was taking for loop alignment. I will share my results at appropriate time in one of the related issue.

@kunalspathak
Copy link
Member Author

kunalspathak commented Nov 5, 2020

System.Collections.IndexerSet.Span(Size: 512)

This benchmark improved by approx. 36%.

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Collections.IndexerSet.Span(Size: 512) 1.56 244.56 156.80

The benchmark appears to be bimodal as seen below, and data alignment could also be the contributor to the bimodality. But on my machine, I saw consistent improvement after loop alignment changes.

image

Looking at the disassembly, the improvements were coming out because of alignment of loop G_M6697_IG06.

Assembly code before loop alignment
; Assembly listing for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; Lcl frame size = 0

G_M6697_IG01:              ;; offset=0000H
						;; bbWeight=1    PerfScore 0.00
G_M6697_IG02:              ;; offset=0000H
 00007ffb`a51132a0        488B4110             mov      rax, gword ptr [rcx+16]
 00007ffb`a51132a4        4885C0               test     rax, rax
 00007ffb`a51132a7        7506                 jne      SHORT G_M6697_IG04
						;; bbWeight=1    PerfScore 3.25
G_M6697_IG03:              ;; offset=0009H
 00007ffb`a51132a9        33D2                 xor      rdx, rdx
 00007ffb`a51132ab        33C9                 xor      ecx, ecx
 00007ffb`a51132ad        EB07                 jmp      SHORT G_M6697_IG05
						;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG04:              ;; offset=000FH
 00007ffb`a51132af        488D5010             lea      rdx, bword ptr [rax+16]
 00007ffb`a51132b3        8B4808               mov      ecx, dword ptr [rax+8]
						;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG05:              ;; offset=0016H
 00007ffb`a51132b6        33C0                 xor      eax, eax
 00007ffb`a51132b8        85C9                 test     ecx, ecx
 00007ffb`a51132ba        7E10                 jle      SHORT G_M6697_IG07
						;; bbWeight=1    PerfScore 1.50
G_M6697_IG06:              ;; offset=001CH
 00007ffb`a51132bc        4C63C0               movsxd   r8, eax
 00007ffb`a51132bf        4533C9               xor      r9d, r9d
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (xor: 2) 32B boundary ...............................
 00007ffb`a51132c2        46890C82             mov      dword ptr [rdx+4*r8], r9d
 00007ffb`a51132c6        FFC0                 inc      eax
 00007ffb`a51132c8        3BC1                 cmp      eax, ecx
 00007ffb`a51132ca        7CF0                 jl       SHORT G_M6697_IG06
						;; bbWeight=4    PerfScore 12.00
G_M6697_IG07:              ;; offset=002CH
 00007ffb`a51132cc        33C0                 xor      eax, eax
						;; bbWeight=1    PerfScore 0.25
G_M6697_IG08:              ;; offset=002EH
 00007ffb`a51132ce        C3                   ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 47, prolog size 0, PerfScore 25.20, instruction count 19 (MethodHash=9a77e5d6) for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; ============================================================
Assembly code after loop alignment
; Assembly listing for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
; Lcl frame size = 0

G_M6697_IG01:              ;; offset=0000H
						;; bbWeight=1    PerfScore 0.00
G_M6697_IG02:              ;; offset=0000H
 00007ffb`a5133580        488B4110             mov      rax, gword ptr [rcx+16]
 00007ffb`a5133584        4885C0               test     rax, rax
 00007ffb`a5133587        7506                 jne      SHORT G_M6697_IG04
						;; bbWeight=1    PerfScore 3.25
G_M6697_IG03:              ;; offset=0009H
 00007ffb`a5133589        33D2                 xor      rdx, rdx
 00007ffb`a513358b        33C9                 xor      ecx, ecx
 00007ffb`a513358d        EB07                 jmp      SHORT G_M6697_IG05
						;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG04:              ;; offset=000FH
 00007ffb`a513358f        488D5010             lea      rdx, bword ptr [rax+16]
 00007ffb`a5133593        8B4808               mov      ecx, dword ptr [rax+8]
						;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG05:              ;; offset=0016H
 00007ffb`a5133596        33C0                 xor      eax, eax
 00007ffb`a5133598        85C9                 test     ecx, ecx
 00007ffb`a513359a        7E1F                 jle      SHORT G_M6697_IG07
; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 4 bytes, loopsize= 16 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (Span)
 00007ffb`a513359c        0F1F4000             align    
; ............................... 32B boundary ...............................
						;; bbWeight=1    PerfScore 1.75
G_M6697_IG06:              ;; offset=0020H
 00007ffb`a51335a0        4C63C0               movsxd   r8, eax
 00007ffb`a51335a3        4533C9               xor      r9d, r9d
 00007ffb`a51335a6        46890C82             mov      dword ptr [rdx+4*r8], r9d
 00007ffb`a51335aa        FFC0                 inc      eax
 00007ffb`a51335ac        3BC1                 cmp      eax, ecx
 00007ffb`a51335ae        7CF0                 jl       SHORT G_M6697_IG06
						;; bbWeight=4    PerfScore 12.00
G_M6697_IG07:              ;; offset=0030H
 00007ffb`a51335b0        33C0                 xor      eax, eax
						;; bbWeight=1    PerfScore 0.25
G_M6697_IG08:              ;; offset=0032H
 00007ffb`a51335b2        C3                   ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 51, prolog size 0, PerfScore 26.95, instruction count 20 (MethodHash=9a77e5d6) for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; ============================================================

@kunalspathak
Copy link
Member Author

kunalspathak commented Nov 6, 2020

PerfLabTests.GetMember

Before loop alignment:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetField 1.567 ms 0.0119 ms 0.0100 ms 1.564 ms 1.559 ms 1.592 ms - - - 39 B
GetMethod1 1.656 ms 0.0183 ms 0.0153 ms 1.649 ms 1.640 ms 1.690 ms - - - 6 B
GetMethod2 3.231 ms 0.0220 ms 0.0195 ms 3.229 ms 3.207 ms 3.268 ms - - - 22 B
GetMethod3 4.855 ms 0.0207 ms 0.0173 ms 4.858 ms 4.827 ms 4.880 ms - - - 42 B
GetMethod4 6.441 ms 0.0579 ms 0.0514 ms 6.419 ms 6.386 ms 6.561 ms - - - 91 B
GetMethod5 8.012 ms 0.0216 ms 0.0191 ms 8.009 ms 7.980 ms 8.047 ms - - - 147 B
GetMethod10 16.261 ms 0.1269 ms 0.1187 ms 16.238 ms 16.111 ms 16.481 ms - - - 585 B
GetMethod12 19.472 ms 0.0900 ms 0.0841 ms 19.462 ms 19.321 ms 19.614 ms - - - 876 B
GetMethod15 24.587 ms 0.0980 ms 0.0765 ms 24.584 ms 24.478 ms 24.746 ms - - - 1313 B
GetMethod20 32.961 ms 0.1547 ms 0.1292 ms 32.917 ms 32.803 ms 33.171 ms - - - 2498 B

After loop alignment:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetField 1.556 ms 0.0167 ms 0.0139 ms 1.549 ms 1.544 ms 1.581 ms - - - 39 B
GetMethod1 1.607 ms 0.0169 ms 0.0149 ms 1.600 ms 1.591 ms 1.635 ms - - - 6 B
GetMethod2 3.222 ms 0.0400 ms 0.0375 ms 3.195 ms 3.183 ms 3.279 ms - - - 22 B
GetMethod3 4.818 ms 0.0332 ms 0.0310 ms 4.809 ms 4.776 ms 4.870 ms - - - 42 B
GetMethod4 6.553 ms 0.0909 ms 0.0850 ms 6.508 ms 6.477 ms 6.722 ms - - - 91 B
GetMethod5 8.206 ms 0.0499 ms 0.0442 ms 8.195 ms 8.144 ms 8.281 ms - - - 147 B
GetMethod10 16.335 ms 0.1014 ms 0.0847 ms 16.300 ms 16.240 ms 16.499 ms - - - 585 B
GetMethod12 20.360 ms 0.9777 ms 1.1259 ms 19.515 ms 19.392 ms 21.965 ms - - - 876 B
GetMethod15 26.117 ms 1.3502 ms 1.5549 ms 25.461 ms 24.483 ms 28.243 ms - - - 1313 B
GetMethod20 32.646 ms 0.2417 ms 0.1887 ms 32.599 ms 32.394 ms 32.991 ms - - - 2186 B

It is surprising that the Allocated is different between 2 runs and not sure why. I tried dumping disassembly of all the methods to see which methods got loop aligned, and there were not much benchmark related.

Methods where alignment was done
; alignment= 2 bytes, loopsize= 75 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (TryInsert)
; alignment= 2 bytes, loopsize= 48 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (MoveNext)
; alignment= 7 bytes, loopsize= 21 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 11, alignmentBoundary= 32B in (Contains)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (CopyTo)
; alignment= 3 bytes, loopsize= 43 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (MoveNext)
; alignment= 2 bytes, loopsize= 67 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 29, alignmentBoundary= 32B in (WidenAsciiToUtf16)
; alignment= 2 bytes, loopsize= 53 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 11, alignmentBoundary= 32B in (ToArray)
; alignment= 3 bytes, loopsize= 43 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (MoveNext)
; alignment= 1 bytes, loopsize= 122 bytes, minBlocksNeeded= 4, extraBytesNotInLoop= 6, alignmentBoundary= 16B in (Add)
; alignment= 7 bytes, loopsize= 61 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 3, alignmentBoundary= 32B in (GenerateRandomId)
; alignment= 3 bytes, loopsize= 90 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 6, alignmentBoundary= 16B in (ReportIfAny)
; alignment= 12 bytes, loopsize= 19 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 13, alignmentBoundary= 32B in (OverheadActionNoUnroll)
; alignment= 12 bytes, loopsize= 19 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 13, alignmentBoundary= 32B in (WorkloadActionNoUnroll)
; alignment= 3 bytes, loopsize= 80 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (Variance)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (SumWithoutOutliers)
; alignment= 6 bytes, loopsize= 64 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 0, alignmentBoundary= 32B in (InverseTwoTailStudent)
; alignment= 2 bytes, loopsize= 91 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 5, alignmentBoundary= 16B in (Print)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (CopyTo)
; alignment= 7 bytes, loopsize= 56 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 8, alignmentBoundary= 16B in (ToArray)

Same thing with Perf_Vector3.LengthSquaredBenchmark.

@adamsitnik
Copy link
Member

It is surprising that the Allocated is different between 2 runs and not sure why

In theory, it should never happen, but it can be caused by the Tiered JIT background thread allocating memory like in dotnet/BenchmarkDotNet#1542

As long as it's rare you can ignore it (we would need to use a memory profiler and see what exactly is being allocated in both runs)

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Mar 10, 2021
@kunalspathak
Copy link
Member Author

Most of the analysis is done and "loop alignment" feature will be ON in .NET 6.0.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

No branches or pull requests

4 participants