Force inlining CheckTimeout in RegexRunner #27262

ViktorHofer · 2018-02-20T00:26:33Z

I can't get any valuable measurements with BDN here. I tried multiple times but it seems either my benchmarks are bad or the absolute costs of the CheckTimeout native calls are to small.

@AndyAyersMS Perfview tells me that without AggresiveInlining it won't inline the method with reason "unprofitable inline". Any idea why it doesn't inline a simple boolean check here (it doesn't inline either without the OR condition). Has it do with the state of inliner here (huge switch)?

Perfview before:

Name	Exc %
system.text.regularexpressions!RegexInterpreter.Go	35.8
system.text.regularexpressions!RegexCharClass.CharInClassInternal	12.1
system.text.regularexpressions!RegexCharClass.CharInClassRecursive	7.1
system.text.regularexpressions!RegexInterpreter.Backtrack	7.0
system.text.regularexpressions!RegexInterpreter.Stringmatch	6.1
system.text.regularexpressions!RegexInterpreter.SetOperator	5.4
system.text.regularexpressions!RegexInterpreter.FindFirstChar	5.3
system.text.regularexpressions!RegexInterpreter.Forwardcharnext	3.5
system.text.regularexpressions!RegexRunner.Scan	3.1
system.text.regularexpressions!RegexRunner.CheckTimeout	2.9
system.text.regularexpressions!RegexInterpreter.Goto	2.5

Perfview after:

Name	Exc %
system.text.regularexpressions!RegexInterpreter.Go	35.3
system.text.regularexpressions!RegexCharClass.CharInClassInternal	13.8
system.text.regularexpressions!RegexCharClass.CharInClassRecursive	8.0
system.text.regularexpressions!RegexInterpreter.Backtrack	7.3
system.text.regularexpressions!RegexInterpreter.Stringmatch	6.0
system.text.regularexpressions!RegexInterpreter.SetOperator	5.3
system.text.regularexpressions!RegexInterpreter.FindFirstChar	5.0
system.text.regularexpressions!RegexInterpreter.Forwardcharnext	3.8
system.text.regularexpressions!RegexRunner.Scan	3.1
system.text.regularexpressions!RegexInterpreter.Goto	2.5

cc @danmosemsft @stephentoub

stephentoub · 2018-02-20T00:28:12Z

Any idea why it doesn't inline a simple boolean check here

Did it inline DoCheckTimeout?

ViktorHofer · 2018-02-20T00:29:47Z

No, but I expected that, as it throws. I used the version from master for checking: https://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexRunner.cs,7aac6cc6e4584ff3

System.Text.RegularExpressions.RegexRunner.CheckTimeout	System.Text.RegularExpressions.RegexRunner.DoCheckTimeout	unprofitable inline

danmoseley

I'd like to know why the attribute is needed, but it seems reasonable to check it in meantime.

stephentoub · 2018-02-20T02:15:38Z

it seems reasonable to check it in meantime

Did it actually improve throughput?

ViktorHofer · 2018-02-20T02:18:02Z

I wouldn't want the PR to be merged before I can prove that the change improves throughput either by benchmarking with BDN or by looking at generated the native code.

danmoseley · 2018-02-20T02:32:57Z

I guess I was drawing the conclusion it had since before Go+Scan (the callers)+CheckTimeout were 41.8% and after were 38.4%. I guess there is no substitute for actual timing though.

AndyAyersMS · 2018-02-20T04:22:56Z

The method with the || is a bit too big for the "always inline" criteria. So it has to pass the profitability screen, and it fails to do so:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.
Inline candidate callsite is boring.  Multiplier increased to 3.3.
calleeNativeSizeEstimate=327
callsiteNativeSizeEstimate=85
benefit multiplier=3.3
threshold=280
Native estimate for function size exceeds threshold for inlining 32.7 > 28 (multiplier = 3.3)

The inliner is actually overestimating the potential win here as this is not really a wrapper method and there is no arg that feeds a constant test.

I don't really see a compelling benefit from this inline either -- what is it that you saw?

danmoseley · 2018-02-20T05:07:35Z

Since in many (?most) cases _ignoreTimeout is true, maybe we can just have

        protected void CheckTimeout()
        {
            if (!_ignoreTimeout)
                 DoCheckTimeout();
        }

        private void DoCheckTimeout()
        {
              // everything else

Which will surely inline and be profitable?

ViktorHofer · 2018-02-20T13:50:03Z

Thanks Andy! I'm curious, where did you get that data log from?

You were right, the method wasn't inlined because the OR case was too big. I just refactored the code to just contain the timeout check and now it is inlined without the AggressiveInlineAttribute.

I don't really see a compelling benefit from this inline either -- what is it that you saw?

To optimize the common case (no timeout provided) it makes sense to inline the if condition to get rid off the unnecessary calls in this tight loop.

danmoseley · 2018-02-20T16:53:50Z

This change is benign enough now I don't think it needs justification by measurement before merging. But I am curious whether you see an improvement.

AndyAyersMS · 2018-02-20T17:40:18Z

The profitabilility analysis comes from setting COMPlus_JitDump=<caller> and using a CHK jit. I didn't have your exact example so I made up a simulation.

I figured it might be helpful to explain how I'd think about this case.

The assembly for this method is something like:

G_M61682_IG02:
       80790C00             cmp      byte  ptr [rcx+12], 0
       750C                 jne      SHORT G_M61682_IG03
       8B4108               mov      eax, dword ptr [rcx+8]
       FFC8                 dec      eax
       894108               mov      dword ptr [rcx+8], eax
       85C0                 test     eax, eax
       7401                 je       SHORT G_M61682_IG04

G_M61682_IG03:
       C3                   ret

G_M61682_IG04:
       48B888083A8AFD7F0000 mov      rax, 0x7FFD8A3A0888

G_M61682_IG05:
       48FFE0               rex.jmp  rax   // tail call

Assuming _ignoreTimeout is usually true, then:

When not inlined: dynamically 4 instructions call; cmp; jne; ret. All branches should be well predicted. Statically a size cost of 32 bytes plus 1 instruction per call site (roughly 6 bytes, depending).
When inlined: dynamically two instructions cmp; jne. Branch well predicted. Statically around 26 bytes per call site, plus another 32 if this method is ever called somewhere where it can't be inlined. The branched-over code is not considered rare and so will be located just after the branch. If this method is called in a loop the cold code will now sit inside the loop body and potentially blow the loop body out of any special loop caches the CPU might have. The inline leaves a residual call so the total register cross section will be the same or larger, so register allocation around the inline body will be the same or worse. Maybe a slight improvement in predictability of the jne if the value of _ignoreTimeout varies but is generally correlated with the call site.

So the call version is perhaps a tiny bit slower than the inline version. It really depends on what else is going on around the call site; the inline version could well be slower.

Likely the perf impact will be hard to measure reliably; generally we have trouble with timing based measurements once the performance difference gets to be less than 1% or so, and there a bunch of potentially confounding microarchitectural issues. The force inline per call site cost is quite a bit larger. The perf impact of code size is also hard to measure as it is very context dependent. Generally larger code will perform well in benchmarks but has the effect of slightly degrading perf of everything else, as larger code uses more of the scarce physical resources (caches, pages, etc). For framework and system code we usually assume that there is a lot of other important code in the system and so we should prefer smaller code unless we know for sure the extra code size brings important benefits.

Things in favor of forcing inlining would be:

_ignoreTimeout usually true;
provably few call sites;
call sites tend to be in high-iteration count loops with little else going on in the loop other than this call;
multiple nearby calls to this method with little else going on in between;
the value of _ignoreTimeout is known at the call site or has been tested very recently before the call.

And things against forcing inlining would be:

_ignoreTimeout varies and is not call site correlated;
large (or unknown) number of call sites;
call sites tend to be in colder code;
call sites appear in hot code but there is a lot of other computation nearby.

The jit's inliner is typically quite conservative, because:

it has to assume that there will be lots of call sites;
it typically doesn't know the either the global or the method relative hotness/coldness of a given call site (hence the boring above). This might be something we can address with better IBC and/or tiering;
it can't easily reason about "what else" might be going on nearby.

danmoseley · 2018-02-20T18:13:45Z

Thanks @AndyAyersMS for this detail. I find it fascinating that inlining a boolean check is not always a win. Certainly I would never have considered extracting a boolean check into a method for perf reasons.

ViktorHofer · 2018-02-20T18:16:48Z

Thanks a lot Andy! Great intel that you shared with us. I will probably get back to this post numerous times.

Commit migrated from dotnet/corefx@b89d83a

ViktorHofer added the area-System.Text.RegularExpressions label Feb 20, 2018

ViktorHofer self-assigned this Feb 20, 2018

danmoseley approved these changes Feb 20, 2018

View reviewed changes

Refactor CheckTimeout to enable inlining

0f691c1

ViktorHofer force-pushed the CheckTimeoutRegex branch from 4cfb8ec to 0f691c1 Compare February 20, 2018 13:53

stephentoub approved these changes Feb 20, 2018

View reviewed changes

danmoseley merged commit b89d83a into dotnet:master Feb 20, 2018

ViktorHofer deleted the CheckTimeoutRegex branch February 20, 2018 18:16

karelz added this to the 2.1.0 milestone Mar 10, 2018

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Refactor CheckTimeout to enable inlining (dotnet/corefx#27262)

fe6b8c5

Commit migrated from dotnet/corefx@b89d83a

Force inlining CheckTimeout in RegexRunner #27262

Force inlining CheckTimeout in RegexRunner #27262

Uh oh!

Conversation

ViktorHofer commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Feb 20, 2018

Uh oh!

ViktorHofer commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ViktorHofer commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 20, 2018

Uh oh!

AndyAyersMS commented Feb 20, 2018

Uh oh!

danmoseley commented Feb 20, 2018

Uh oh!

ViktorHofer commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 20, 2018

Uh oh!

AndyAyersMS commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 20, 2018

Uh oh!

ViktorHofer commented Feb 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ViktorHofer commented Feb 20, 2018 •

edited

Loading

ViktorHofer commented Feb 20, 2018 •

edited

Loading

stephentoub commented Feb 20, 2018 •

edited

Loading

ViktorHofer commented Feb 20, 2018 •

edited

Loading

ViktorHofer commented Feb 20, 2018 •

edited

Loading

AndyAyersMS commented Feb 20, 2018 •

edited

Loading