Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@ViktorHofer
Copy link
Member

@ViktorHofer ViktorHofer commented Feb 20, 2018

I can't get any valuable measurements with BDN here. I tried multiple times but it seems either my benchmarks are bad or the absolute costs of the CheckTimeout native calls are to small.

@AndyAyersMS Perfview tells me that without AggresiveInlining it won't inline the method with reason "unprofitable inline". Any idea why it doesn't inline a simple boolean check here (it doesn't inline either without the OR condition). Has it do with the state of inliner here (huge switch)?

Perfview before:

Name Exc %
system.text.regularexpressions!RegexInterpreter.Go 35.8
system.text.regularexpressions!RegexCharClass.CharInClassInternal 12.1
system.text.regularexpressions!RegexCharClass.CharInClassRecursive 7.1
system.text.regularexpressions!RegexInterpreter.Backtrack 7.0
system.text.regularexpressions!RegexInterpreter.Stringmatch 6.1
system.text.regularexpressions!RegexInterpreter.SetOperator 5.4
system.text.regularexpressions!RegexInterpreter.FindFirstChar 5.3
system.text.regularexpressions!RegexInterpreter.Forwardcharnext 3.5
system.text.regularexpressions!RegexRunner.Scan 3.1
system.text.regularexpressions!RegexRunner.CheckTimeout 2.9
system.text.regularexpressions!RegexInterpreter.Goto 2.5

Perfview after:

Name Exc %
system.text.regularexpressions!RegexInterpreter.Go 35.3
system.text.regularexpressions!RegexCharClass.CharInClassInternal 13.8
system.text.regularexpressions!RegexCharClass.CharInClassRecursive 8.0
system.text.regularexpressions!RegexInterpreter.Backtrack 7.3
system.text.regularexpressions!RegexInterpreter.Stringmatch 6.0
system.text.regularexpressions!RegexInterpreter.SetOperator 5.3
system.text.regularexpressions!RegexInterpreter.FindFirstChar 5.0
system.text.regularexpressions!RegexInterpreter.Forwardcharnext 3.8
system.text.regularexpressions!RegexRunner.Scan 3.1
system.text.regularexpressions!RegexInterpreter.Goto 2.5

cc @danmosemsft @stephentoub

@stephentoub
Copy link
Member

Any idea why it doesn't inline a simple boolean check here

Did it inline DoCheckTimeout?

@ViktorHofer
Copy link
Member Author

ViktorHofer commented Feb 20, 2018

No, but I expected that, as it throws. I used the version from master for checking: https://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexRunner.cs,7aac6cc6e4584ff3

System.Text.RegularExpressions.RegexRunner.CheckTimeout System.Text.RegularExpressions.RegexRunner.DoCheckTimeout unprofitable inline

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know why the attribute is needed, but it seems reasonable to check it in meantime.

@stephentoub
Copy link
Member

stephentoub commented Feb 20, 2018

it seems reasonable to check it in meantime

Did it actually improve throughput?

@ViktorHofer
Copy link
Member Author

ViktorHofer commented Feb 20, 2018

I wouldn't want the PR to be merged before I can prove that the change improves throughput either by benchmarking with BDN or by looking at generated the native code.

@danmoseley
Copy link
Member

I guess I was drawing the conclusion it had since before Go+Scan (the callers)+CheckTimeout were 41.8% and after were 38.4%. I guess there is no substitute for actual timing though.

@AndyAyersMS
Copy link
Member

The method with the || is a bit too big for the "always inline" criteria. So it has to pass the profitability screen, and it fails to do so:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.
Inline candidate callsite is boring.  Multiplier increased to 3.3.
calleeNativeSizeEstimate=327
callsiteNativeSizeEstimate=85
benefit multiplier=3.3
threshold=280
Native estimate for function size exceeds threshold for inlining 32.7 > 28 (multiplier = 3.3)

The inliner is actually overestimating the potential win here as this is not really a wrapper method and there is no arg that feeds a constant test.

I don't really see a compelling benefit from this inline either -- what is it that you saw?

@danmoseley
Copy link
Member

Since in many (?most) cases _ignoreTimeout is true, maybe we can just have

        protected void CheckTimeout()
        {
            if (!_ignoreTimeout)
                 DoCheckTimeout();
        }

        private void DoCheckTimeout()
        {
              // everything else

Which will surely inline and be profitable?

@ViktorHofer
Copy link
Member Author

ViktorHofer commented Feb 20, 2018

Thanks Andy! I'm curious, where did you get that data log from?

You were right, the method wasn't inlined because the OR case was too big. I just refactored the code to just contain the timeout check and now it is inlined without the AggressiveInlineAttribute.

I don't really see a compelling benefit from this inline either -- what is it that you saw?

To optimize the common case (no timeout provided) it makes sense to inline the if condition to get rid off the unnecessary calls in this tight loop.

@danmoseley
Copy link
Member

This change is benign enough now I don't think it needs justification by measurement before merging. But I am curious whether you see an improvement.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 20, 2018

The profitabilility analysis comes from setting COMPlus_JitDump=<caller> and using a CHK jit. I didn't have your exact example so I made up a simulation.

I figured it might be helpful to explain how I'd think about this case.

The assembly for this method is something like:

G_M61682_IG02:
       80790C00             cmp      byte  ptr [rcx+12], 0
       750C                 jne      SHORT G_M61682_IG03
       8B4108               mov      eax, dword ptr [rcx+8]
       FFC8                 dec      eax
       894108               mov      dword ptr [rcx+8], eax
       85C0                 test     eax, eax
       7401                 je       SHORT G_M61682_IG04

G_M61682_IG03:
       C3                   ret

G_M61682_IG04:
       48B888083A8AFD7F0000 mov      rax, 0x7FFD8A3A0888

G_M61682_IG05:
       48FFE0               rex.jmp  rax   // tail call

Assuming _ignoreTimeout is usually true, then:

  • When not inlined: dynamically 4 instructions call; cmp; jne; ret. All branches should be well predicted. Statically a size cost of 32 bytes plus 1 instruction per call site (roughly 6 bytes, depending).
  • When inlined: dynamically two instructions cmp; jne. Branch well predicted. Statically around 26 bytes per call site, plus another 32 if this method is ever called somewhere where it can't be inlined. The branched-over code is not considered rare and so will be located just after the branch. If this method is called in a loop the cold code will now sit inside the loop body and potentially blow the loop body out of any special loop caches the CPU might have. The inline leaves a residual call so the total register cross section will be the same or larger, so register allocation around the inline body will be the same or worse. Maybe a slight improvement in predictability of the jne if the value of _ignoreTimeout varies but is generally correlated with the call site.

So the call version is perhaps a tiny bit slower than the inline version. It really depends on what else is going on around the call site; the inline version could well be slower.

Likely the perf impact will be hard to measure reliably; generally we have trouble with timing based measurements once the performance difference gets to be less than 1% or so, and there a bunch of potentially confounding microarchitectural issues. The force inline per call site cost is quite a bit larger. The perf impact of code size is also hard to measure as it is very context dependent. Generally larger code will perform well in benchmarks but has the effect of slightly degrading perf of everything else, as larger code uses more of the scarce physical resources (caches, pages, etc). For framework and system code we usually assume that there is a lot of other important code in the system and so we should prefer smaller code unless we know for sure the extra code size brings important benefits.

Things in favor of forcing inlining would be:

  • _ignoreTimeout usually true;
  • provably few call sites;
  • call sites tend to be in high-iteration count loops with little else going on in the loop other than this call;
  • multiple nearby calls to this method with little else going on in between;
  • the value of _ignoreTimeout is known at the call site or has been tested very recently before the call.

And things against forcing inlining would be:

  • _ignoreTimeout varies and is not call site correlated;
  • large (or unknown) number of call sites;
  • call sites tend to be in colder code;
  • call sites appear in hot code but there is a lot of other computation nearby.

The jit's inliner is typically quite conservative, because:

  • it has to assume that there will be lots of call sites;
  • it typically doesn't know the either the global or the method relative hotness/coldness of a given call site (hence the boring above). This might be something we can address with better IBC and/or tiering;
  • it can't easily reason about "what else" might be going on nearby.

@danmoseley
Copy link
Member

Thanks @AndyAyersMS for this detail. I find it fascinating that inlining a boolean check is not always a win. Certainly I would never have considered extracting a boolean check into a method for perf reasons.

@danmoseley danmoseley merged commit b89d83a into dotnet:master Feb 20, 2018
@ViktorHofer
Copy link
Member Author

Thanks a lot Andy! Great intel that you shared with us. I will probably get back to this post numerous times.

@ViktorHofer ViktorHofer deleted the CheckTimeoutRegex branch February 20, 2018 18:16
@karelz karelz added this to the 2.1.0 milestone Mar 10, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants