Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlined GC Polls for call to methods with SuppressGCTransitionAttribute #39111

Merged
merged 2 commits into from
Jul 14, 2020

Conversation

erozenfeld
Copy link
Member

  • Emit inlined GC Polls for methods with SuppressGCTransitionAttribute
    when possible and when optimizing.

  • Emit only one GC poll per basic block.

  • Move insertion of GC polls to a new phase fgInsertGCPolls that runs after
    most optimizations so that we don't insert unnecessary GC polls.

  • I plan to delete fgCreateGCPolls phase that was previously used to insert
    GC polls for platforms that don't support hijacking in a subsequent PR.
    We currently don't support such platforms.

  • Fix fgCreateGCPoll to be able to insert inlined GC polls for BBJ_NONE and
    BBJ_THROW basic blocks.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 10, 2020
@erozenfeld
Copy link
Member Author

erozenfeld commented Jul 10, 2020

For this benchmark:

using System;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace GCPolls
{
    public class Program
    {
        [DllImport("kernel32.dll")]
        [SuppressGCTransitionAttribute]
        extern static int GetTickCount();

        [Benchmark]
        public int GetTickCountManaged() => GetTickCount();

        static void Main(string[] args)
        {
            BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
        }
    }
}

the results are

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.900 (1909/November2018Update/19H2)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.6.20318.15
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.30506, CoreFX 5.0.20.30506), X64 RyuJIT
  Job-NDPIOL : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-LCBKIK : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT


|              Method |        Job |                                                                            Toolchain |     Mean |     Error |    StdDev | Ratio |
|-------------------- |----------- |------------------------------------------------------------------------------------- |---------:|----------:|----------:|------:|
| GetTickCountManaged | Job-NDPIOL | \runtime1\artifacts\tests\coreclr\Windows_NT.x64.release\Tests\Core_Root\CoreRun.exe | 2.080 ns | 0.0261 ns | 0.0218 ns |  0.63 |
| GetTickCountManaged | Job-LCBKIK | \runtime3\artifacts\tests\coreclr\Windows_NT.x64.release\Tests\Core_Root\CoreRun.exe | 3.307 ns | 0.0141 ns | 0.0110 ns |  1.00 |

@erozenfeld
Copy link
Member Author

Assembly diff for the benchmark:

G_M34627_IG01:
       55                   push     rbp
       4883EC20             sub      rsp, 32
       488D6C2420           lea      rbp, [rsp+20H]                           
G_M34627_IG02:
-      E801672A5F           call     CORINFO_HELP_POLL_GC
+      48B800C6641AF87F0000 mov      rax, 0x7FF81A64C600
+      833800               cmp      dword ptr [rax], 0
+      750C                 jne      SHORT G_M34627_IG05
G_M34627_IG03:
       E8BE7296FF           call     GCPolls.Program:GetTickCount():int
       90                   nop
G_M34627_IG04:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
+G_M34627_IG05:
+      E806682A5F           call     CORINFO_HELP_POLL_GC
+      EBED                 jmp      SHORT G_M34627_IG03

-; Total bytes of code 27, prolog size 10, PerfScore 8.70, (MethodHash=118078bc) for method GCPolls.Program:GetTickCountManaged():int:this
+; Total bytes of code 44, prolog size 10, PerfScore 12.65, (MethodHash=118078bc) for method GCPolls.Program:GetTickCountManaged():int:this

@erozenfeld
Copy link
Member Author

I prototyped switching Math.Pow be a QCall with [SuppressGCTransition] and verified that for this example:

using System;
using System.Threading;
using System.Diagnostics;

class Test
{
    static void GCThread()
    {
        for (;;)
        {
            var sw = new System.Diagnostics.Stopwatch();
            sw.Start();
            GC.Collect();
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Thread.Sleep(100);
        }
    }

    static double Work(double x)
    {
        return
        Math.Pow(x,2.2)+Math.Pow(x,4.3)+Math.Pow(x,8.8)+
        Math.Pow(x,3.2)+Math.Pow(x,5.3)+Math.Pow(x,7.8)+
        Math.Pow(x,4.2)+Math.Pow(x,6.3)+Math.Pow(x,6.8)+
        Math.Pow(x,5.2)+Math.Pow(x,7.3)+Math.Pow(x,5.8)+
        Math.Pow(x,6.2)+Math.Pow(x,8.3)+Math.Pow(x,4.8)+
        Math.Pow(x,7.2)+Math.Pow(x,9.3)+Math.Pow(x,3.8);
    }

    static void Main()
    {
        new Thread(GCThread).Start();

        for (;;)
            Work(48929.47982);
    }
}

we insert one inlined GC poll and GC pause is under 1 ms.

@erozenfeld
Copy link
Member Author

x64 PMI framework diffs:

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 1786 (0.00% of base)
    diff is a regression.
Top file regressions (bytes):
         944 : System.Console.dasm (1.99% of base)
         409 : System.Private.CoreLib.dasm (0.01% of base)
         134 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.00% of base)
          86 : runincontext.dasm (1.02% of base)
          53 : xunit.console.dasm (0.06% of base)
          41 : Microsoft.Extensions.Http.dasm (0.09% of base)
          34 : System.Net.HttpListener.dasm (0.01% of base)
          21 : xunit.execution.dotnet.dasm (0.01% of base)
          17 : System.Diagnostics.TraceSource.dasm (0.04% of base)
          17 : System.Drawing.Primitives.dasm (0.05% of base)
          17 : xunit.performance.execution.dasm (0.11% of base)
          13 : System.IO.FileSystem.dasm (0.01% of base)
12 total files with Code Size differences (0 improved, 12 regressed), 252 unchanged.
Top method regressions (bytes):
         180 (16.30% of base) : System.Private.CoreLib.dasm - ProcessorIdCache:ProcessorNumberSpeedCheck():bool (2 methods)
         137 ( 6.58% of base) : System.Console.dasm - ConsolePal:MoveBufferArea(int,int,int,int,int,int,ushort,int,int)
          80 ( 6.00% of base) : System.Console.dasm - ConsolePal:SetWindowSize(int,int)
          78 (17.53% of base) : System.Console.dasm - ConsolePal:GetBufferInfo(bool,byref):CONSOLE_SCREEN_BUFFER_INFO
          55 (11.80% of base) : System.Console.dasm - ConsolePal:get_KeyAvailable():bool
          53 ( 2.10% of base) : xunit.console.dasm - ConsoleRunner:RunProject(XunitProject,bool,Nullable`1,Nullable`1,Nullable`1,bool,bool,Nullable`1,bool,bool,bool):int:this
          48 ( 1.85% of base) : runincontext.dasm - TestRunner:DoWorkStress():int:this
          47 (14.97% of base) : System.Console.dasm - ConsolePal:set_CursorSize(int)
          47 (26.11% of base) : System.Console.dasm - ConsolePal:set_CursorVisible(bool)
          40 (11.30% of base) : System.Console.dasm - ConsolePal:GetStandardFile(int,int,bool):Stream (2 methods)
          38 (25.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - QPCTime:.ctor():this
          38 ( 3.27% of base) : runincontext.dasm - TestRunner:ExecuteAndUnload(List`1,byref,byref):int:this
          34 ( 8.65% of base) : System.Console.dasm - ConsolePal:SetCursorPosition(int,int)
          28 ( 6.47% of base) : System.Console.dasm - ConsolePal:SetBufferSize(int,int)
          27 (27.84% of base) : System.Private.CoreLib.dasm - Stopwatch:StartNew():Stopwatch
          26 (14.21% of base) : System.Console.dasm - ConsolePal:get_TreatControlCAsInput():bool
          25 ( 4.98% of base) : System.Console.dasm - ConsolePal:SetWindowPosition(int,int)
          24 ( 2.09% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - SymbolReader:GetPhysicalFileFromServer(String,String,String,Predicate`1):bool:this
          24 ( 1.34% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ZippedETLWriter:WriteArchive(int):bool:this
          24 ( 1.54% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ZippedETLWriter:PrepForWrite():List`1:this
Top method regressions (percentages):
          22 (75.86% of base) : Microsoft.Extensions.Http.dasm - ValueStopwatch:StartNew():ValueStopwatch
          17 (58.62% of base) : System.Private.CoreLib.dasm - Stopwatch:GetTimestamp():long
          17 (54.84% of base) : System.Console.dasm - ConsolePal:IsInputRedirectedCore():bool
          17 (54.84% of base) : System.Console.dasm - ConsolePal:IsOutputRedirectedCore():bool
          17 (54.84% of base) : System.Console.dasm - ConsolePal:IsErrorRedirectedCore():bool
          17 (53.12% of base) : System.Console.dasm - ConsolePal:get_InputHandle():long
          17 (53.12% of base) : System.Console.dasm - ConsolePal:get_OutputHandle():long
          17 (53.12% of base) : System.Console.dasm - ConsolePal:get_ErrorHandle():long
          17 (50.00% of base) : System.Private.CoreLib.dasm - Stopwatch:QueryPerformanceCounter():long
          17 (38.64% of base) : System.Console.dasm - Console:get_LargestWindowWidth():int
          17 (38.64% of base) : System.Console.dasm - Console:get_LargestWindowHeight():int
          17 (37.78% of base) : System.Drawing.Primitives.dasm - KnownColorTable:GetSystemColorArgb(int):int
          17 (34.69% of base) : System.Console.dasm - ConsolePal:get_LargestWindowWidth():int
          17 (34.69% of base) : System.Console.dasm - ConsolePal:get_LargestWindowHeight():int
          17 (34.69% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(int):long
          17 (31.48% of base) : xunit.performance.execution.dasm - BenchmarkEventSource:GetTimestamp():double:this
          17 (29.31% of base) : System.Private.CoreLib.dasm - Stopwatch:Start():this
          17 (29.31% of base) : System.Private.CoreLib.dasm - Stopwatch:Restart():this
          17 (28.81% of base) : System.Net.HttpListener.dasm - DefaultKeepAliveTracker:OnDataReceived():this
          17 (28.81% of base) : System.Net.HttpListener.dasm - DefaultKeepAliveTracker:OnDataSent():this
62 total methods with Code Size differences (0 improved, 62 regressed), 243333 unchanged.

@erozenfeld
Copy link
Member Author

@briansull @AndyAyersMS PTAL
/cc @jkotas @AaronRobinsonMSFT @VSadov @dotnet/jit-contrib

@erozenfeld
Copy link
Member Author

erozenfeld commented Jul 10, 2020

Fixes #13582.
Unblocks #13820.

@EgorBo
Copy link
Member

EgorBo commented Jul 10, 2020

Just tried your branch locally:

[DllImport("kernel32.dll")]
[SuppressGCTransition]
public static extern int GetCurrentProcessId();

[DllImport("kernel32.dll")]
[SuppressGCTransition]
public static extern int GetCurrentThreadId();


static int Foo(bool c1, bool c2)
{
    if (c1)
        return GetCurrentProcessId();
    if (c2)
        return GetCurrentThreadId();

    return 42;
}

Asm:

; Method ConsoleApp212.Program:Foo(bool,bool):int
G_M46591_IG01:
       55                   push     rbp
       4883EC20             sub      rsp, 32
       488D6C2420           lea      rbp, [rsp+20H]
						;; bbWeight=1    PerfScore 1.75

G_M46591_IG02:
       84C9                 test     cl, cl
       741B                 je       SHORT G_M46591_IG06
						;; bbWeight=1    PerfScore 1.25

G_M46591_IG03:
       48B810760AB2FD7F0000 mov      rax, 0x7FFDB20A7610
       833800               cmp      dword ptr [rax], 0
       7536                 jne      SHORT G_M46591_IG11
						;; bbWeight=0.50 PerfScore 1.63

G_M46591_IG04:
       E8BAEBFFFF           call     ConsoleApp212.Program:GetCurrentProcessId():int
       90                   nop      
						;; bbWeight=0.50 PerfScore 0.63

G_M46591_IG05:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=0.50 PerfScore 1.00

G_M46591_IG06:
       84D2                 test     dl, dl
       741B                 je       SHORT G_M46591_IG09
       48B810760AB2FD7F0000 mov      rax, 0x7FFDB20A7610
       833800               cmp      dword ptr [rax], 0
       751E                 jne      SHORT G_M46591_IG12
						;; bbWeight=0.50 PerfScore 2.25

G_M46591_IG07:
       E8A7EBFFFF           call     ConsoleApp212.Program:GetCurrentThreadId():int
       90                   nop      
						;; bbWeight=0.50 PerfScore 0.63

G_M46591_IG08:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=0.50 PerfScore 1.00

G_M46591_IG09:
       B82A000000           mov      eax, 42
						;; bbWeight=0.50 PerfScore 0.13

G_M46591_IG10:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=0.50 PerfScore 1.00

G_M46591_IG11:
       E838ED925F           call     CORINFO_HELP_POLL_GC
       EBC3                 jmp      SHORT G_M46591_IG04
						;; bbWeight=0    PerfScore 0.00

G_M46591_IG12:
       E831ED925F           call     CORINFO_HELP_POLL_GC
       EBDB                 jmp      SHORT G_M46591_IG07
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 97

is it possible to re-use CORINFO_HELP_POLL_GC call bb (the code emitted two)?

@erozenfeld
Copy link
Member Author

I don't think it's possible. The basic blocks have different jmp targets.

@EgorBo
Copy link
Member

EgorBo commented Jul 10, 2020

I don't think it's possible. The basic blocks have different jmp targets.

Oops, didn't notice it, thanks

@EgorBo
Copy link
Member

EgorBo commented Jul 10, 2020

a minor optimization 🙂
if predecessors (bbs) have GC_POLL -> don't insert one more, e.g.

static int Foo()
{
    if (GetCurrentProcessId() == -1) // inserts GCPoll
        return GetCurrentThreadId(); // also inserts GCPoll but predecessor already have one.
    return -1;
}

(If you can rely on BBF_HAS_SUPPRESSGC_CALL flag)

@erozenfeld
Copy link
Member Author

BBF_HAS_SUPPRESSGC_CALL was added to speed up the jit when not optimizing. It can't be relied upon when optimizing.
What you are suggesting is doable but I'd like to keep it simple for now.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few suggestions for you to consider.

src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Jul 11, 2020

@erozenfeld I added some tests for the JIT here when I originally add SuppressGCTransition. I wonder if it would be a good idea to add some additional cases with this updated work.

@erozenfeld
Copy link
Member Author

@AaronRobinsonMSFT Do you have any particular suggestions for additional tests?

@AaronRobinsonMSFT
Copy link
Member

@erozenfeld Not specifically. The tests I wrote were added because the behavior of properties in loops and in predicates was causing issues with my implementation. Judging from your changes it seems that inserted polls are cleaner and less impacted by specific code patterns. I defer to you and the JIT team to decide if any additional are warranted.

* Emit inlined GC Polls for methods with SuppressGCTransitionAttribute
  when possible and when optimizing.

* Emit only one GC poll per basic block.

* Move insertion of GC polls to a new phase `fgInsertGCPolls` that runs after
  most optimizations so that we don't insert unnecessary GC polls.

* I plan to delete `fgCreateGCPolls` phase that was previously used to insert
  GC polls for platforms that don't support hijacking in a subsequent PR.
  We currently don't support such platforms.

* Fix `fgCreateGCPoll` to be able to insert inlined GC polls for `BBJ_NONE` and
`BBJ_THROW` basic blocks.
#endif // DEBUG

// we don't want to split the single return block
pollType = GCPOLL_CALL;
Copy link
Member

@VSadov VSadov Jul 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this case. We will not inline the poll here. It this common?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't ever have a call to a method with SuppressGCTransitionAttribute in genReturnBB so we shouldn't see this case.

I copied these lines from fgCreateGCPolls, which was used for the GC polls for platforms that don't support hijacking. I'd like to leave these lines just in case we reuse this method for inserting polls for something else in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks!

@@ -9249,7 +9249,7 @@ void Compiler::optRemoveRedundantZeroInits()
Statement* next = stmt->GetNextStmt();
for (GenTree* tree = stmt->GetTreeList(); tree != nullptr; tree = tree->gtNext)
{
if (((tree->gtFlags & GTF_CALL) != 0) && (!tree->IsCall() || !tree->AsCall()->IsSuppressGCTransition()))
if (((tree->gtFlags & GTF_CALL) != 0))
{
hasGCSafePoint = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have made this optimization more conservative.
This local 'hasGCSafePoint ' probably should now be renamed 'hasCall' .

There is also a method called Compiler::IsGcSafePoint that can be used to determine if a call is a GC safe point.
It is currently only used by fgMorph to set the BBF_GC_SAFE_POINT flag on a BasicBlock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. This code needs to know if the node is a potential GC-safe point. Since now we insert GC Polls after this optimization, we have to treat calls to methods with SuppressGCTransitionAttribute as potential GC-safe points.

Compiler::IsGcSafePoint will return true if the node is definitely a GC-safe point and will return false if the node may or may not be a GC-safe point so it shouldn't be used here.

@erozenfeld
Copy link
Member Author

@AndyAyersMS I addressed your feedback and also pushed a fix for x86 tail calls via helpers (in a separate commit). Can you please take another look?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants