[Proposal] Inline delegate invocation #11578

hypeartist · 2016-05-26T11:07:04Z

Is it possible to make it happen? (Maybe following the same rules applied to ordinary methods)

HaloFour · 2016-05-26T11:08:58Z

You might want to ask that in the coreclr repo as method inlining is a concern for the JIT compiler, not the C# compiler.

orthoxerox · 2016-05-26T13:51:31Z

@hypeartist Do you mean things like (x => Console.Write(x))("Hello");? This is not supported by the current type inference engine. Delegates must have an explicitly defined type, so they can be assigned to explicitly typed variables or fields, passed as arguments, but not invoked immediately or assigned to variables with inferred types.

AlgorithmsAreCool · 2016-05-26T19:41:55Z

@hypeartist Some of your pain might go away with local functions which are (presumably) able to be inlined.

string MyFunc()
{
    // Lambda/delegate (cannot be inlined, allocates memory)
    Func<string, int> getLen = (str => str.Length);

    // local function using arrow syntax (potential new feature)
    int getLen(string str) => str.Length;
}

hypeartist · 2016-05-27T08:59:39Z

@orthoxerox
I mean somthing like:

        public override void Check(Func<int> fn, out string result, out string expected)
        {
            result = fn().ToString();
            expected = (1 + 2 + 3 + 4).ToString();
        }

Make fn's body to be inlined (if it meets the inlining rules)

AlgorithmsAreCool · 2016-05-27T13:47:41Z

@hypeartist

I don't think this is possible at all. Here is why, (Forgive me if I tell you things your already know).

Inlining is only done at the JIT(or native compiler) level today and will probably always be done there. The JIT compiles the method Check before it is first called with actual data. Put another way, Check() only knows fn is a reference type at JIT compile time, not what any possible value (or IL body) of fn might be at invocation time since the method will only be JIT compiled once but called many times with many different values of fn.

The only way to take advantage of inlining is with data that is known at compile time (technically JIT time). So if fn was declared locally with a body inside of Check(), it might be possible to inline it's body if the JIT is fancy enough. But not if it is passed in from the outside because it could be anything. (or more importantly, it could be different things at different times.)

Hypothetically we could make this work if you are willing to re-JIT Check() everytime it is called inorder to inline fn. This is known as interpeting would add a lot of overhead (and basically be an interpreter). Keep in mind callsite caching (like what we do to make dynamic fast) wouldn't help because it only deals with binding calls to functions (argument types, not argument data).

Interpreting is almost always slower in general than JITing and might not be worthwhile in this case because either:

fn would only get called a couple of times:
- Result: The time spent re-JITing Check() would still be slower than the optimization.
fn is called enough times (dozens probably) to be worth the cost of the re-JIT
- Result: inlining fn might bloat the emitted code size of Check() so much it might not be worth the tradeoff (add this would slow JITing a bit).
fn is called in a tight loop thousands of times:
- Result: This case might be worthwhile, but the body of fn would have to be tiny to make it useful. The JIT currently only inlines methods smaller than 32bytes (IL size) right now, so your lambda body would need to be very simple.

Does this answer your question?

hypeartist · 2016-05-27T15:13:22Z

@AlgorithmsAreCool
Thanx a lot for such a detailed explaination. Now I see clear. But what about some kind of "call (invocation) chain inspection"? Usually Func<int> fn definition (or specialization) are "near" enough to backtrace to it and see what actually fn are? May be it can be done even at IL-generation level.

AlgorithmsAreCool · 2016-05-27T16:54:05Z

I think this is possible yes. I believe it is called Data Flow Analysis and is a powerful optimization technique seen in c++ compilers typically.

Lets look at a optimizable case:

void Main()
{
    var data = new[] { 1.0, 2.0, 5.0 ... };
    Func<double, double> scale = Math.Tanh;
    ScaleData(data, scale);
}

void ScaleData(double[] data, Func<double, double> scale)
{
     for(int i = 0; i < data.Length; i++)
         data[i] = scale(data[i]);
}

In this example the human eye can see that scale is a simple reference to a static function. So the C# compiler vould emit a specialized version of ScaleData and replace the callsite in Main with a call to the new version like this

This is actually a bad example since Math.Tanh is a compiler intrinsic

void Main()
{
    var data = new[] { 1.0, 2.0, 5.0 ... };

    //Compiler removed
    //Func<double, double> scale = Math.Tanh;
    //ScaleData(data, scale);

    //compiler generated call
    ScaleData_tanh(data);
}

//Compiler generated
void ScaleData_tanh(double[] data)
{
    for(int i = 0; i < data.Length; i++)
    {
         double x = data[i];
         data[i] = (Math.Exp(x) - Math.Exp(-x)) / (Math.Exp(x) + Math.Exp(-x));
    }
}

void ScaleData(double[] data, Func<double, double> scale)
{
     for(int i = 0; i < data.Length; i++)
         data[i] = scale(data[i]);
}

This would be a win because we have saved a call in a 'hot' loop. It could be done by Roslyn or by the JIT. But this optimization has a few limits/drawbacks however.

ScaleData cannot be virtual or an interface call.
scale must be provable as constant at compile (or JIT) time.
There are now 2 or more versions of ScaleData in memory. Larger code size is not usually a good thing.

So yes it is possible to do this. But the C#/VB team typically don't like putting heavy optimizations into Roslyn, and the JIT pobably doesn't have the time to do this complex flow analysis. The JIT has to do quick optimizations that only take a few millisecond to keep the program starting up quickly. This opimization case would be better suited for the CoreRT or .NET Native projects where the compiler can take several minutes to study and tightly optimize the code.

Did this answer your question?

svick · 2016-05-27T19:44:27Z

@AlgorithmsAreCool

This opimization case would be better suited for the CoreRT or .NET Native projects where the compiler can take several minutes to study and tightly optimize the code.

Either that, or tiered JIT.

AlgorithmsAreCool · 2016-05-27T19:50:25Z

@svick I hope that happens one day soon.

I forgot to mention llilc too!

GSPP · 2016-05-29T14:28:13Z

This optimization is being done routinely by the HotSpot JVM. The .NET JITs are just very far behind. .NET developers are not used to vcalls being optimized away effectively.

From an optimization standpoint the techniques to devirtualize vcalls, interface calls and delegate calls are the same.

The JVM monitors what classes actually are loaded. That way it often knows that a vcall can have only one possible target. The CLR could do the same thing.

The JVM also re-JITs and profiles the actual call targets. It then generates patterns like

if (isExpectedTarget) //99.999% case
 DirectCall(); //Possibly inline
else
 IndirectCall();

The check can often be pushed outside loops. isExpectedTarget is just obj.internalHeaderField == someConstant. Similar for delegates.

Profiling is very easy with a tiered JIT. The first tier profiles, the second tier does not. The main drawback is that a profile that is created once never changes. But that is rare and this simple scheme already result in high gains.

The JVM has a very sophisticated tiering scheme with multiple tiers and complicated rules for upgrading and downgrading the tier. I don't think that is necessary to realize gains. Rather, the JVM designers must have realized how important tiering is. So they invested a lot of time to squeeze the last points of performance out of it .NET can get started with a simpler scheme.

AlgorithmsAreCool · 2016-05-29T16:12:43Z

@GSPP

Good point, Hotspot is a wonderful JIT indeed.

Another thought is Chakra (IE JIT), is also tiered and performs profile guided monomorphic call optimization at runtime. Chakra can even replace a optimization profile if the execution flow changes to break earlier optimizations.

In fact, that blog post has several optimizations that I wish .Net had.

GSPP · 2016-05-29T16:19:04Z

It boggles my mind how a platform that runs on hundreds of millions of servers has such an under-invested JIT. Isn't .NET the platform that is 2nd in the number of servers it drives (where Java is 1st)?!

The amount of money that is wasted on, say, 10% more servers needed due to the sub-par code quality with .NET is unfathomably high. It's equal to the TCO of >10 millions of servers.

Just the amount that Microsoft themselves waste on unneeded servers with Bing and Office 360 must be more than the entire JIT team's budget.

It is known how to make the JIT far better. This is not research work (to a large extent; excluding the machine learning inlining decisions effort that is underway right now).

AlgorithmsAreCool · 2016-05-29T16:43:13Z

Looking into it a bit, I wonder if the departure of Kevin Frei to Fackbook in july 2014 has slowed advancement of RyuJit. He was codgen dev lead for .NET and RyuJit might have been his design.

Edit: See his twitter

GSPP · 2016-05-29T17:07:46Z

RyuJIT is based on some pre-existing JIT codebase according to Andrew Ayers in his LLVM talk. He says "it's a baroque codebase that nobody feels quite confident in". This wording is slightly misremembered by me. The actual wording might have been not as strong but certainly in this direction.

It's certainly not entirely new code. The tree-based IR design feels wrong. It's not what state of the art compilers (LLVM, HotSpot) use. I don't think anyone would use such a design if writing a new JIT. It would be entirely SSA based.

AlgorithmsAreCool · 2016-05-29T17:13:09Z

Fascinating! I'll have to find that talk.

GSPP · 2016-05-29T17:19:04Z

Yeah... I so don't understand the strategy. Not sure why a new JIT was created that's hardly better than the old. Maybe it's a stop-gap thing to be able to release CoreClr? Maybe the old JIT was not suitable for that (licensing and patent fears?).

The team has stated that they want a tiered JIT. So the stop-gap theory makes sense where some powerful code generator is the 2nd tier (maybe LLVM). Once the 2nd tier is there the 1st tier can be an interpreter which is very cheap to implement.

The video is here: http://llvm.org/devmtg/2015-04/ 2:10

"baroque internal architecture ... even the people who work on it are kind of nervous about aspects of the code".

orthoxerox · 2016-05-29T20:55:18Z

If only Mike Pall wasn't so tired of writing JIT compilers...

hypeartist · 2016-05-30T07:18:34Z

@GSPP

devirtualize vcalls, interface calls and delegate calls

This is exactly what I was thinking of! (just was lack of right words :) )

gafter · 2016-06-03T19:56:01Z

This issue is misplaced in the Roslyn repository. The Roslyn compilers are not intended to perform optimizations, but rather to translate source code into IL. The runtime compiler is where optimizations occur (or do not occur, as in this case). You probably want to report this in the coreclr repository.

Pilchie added the Area-Compilers label May 26, 2016

gafter closed this as completed Jun 3, 2016

gafter added the Resolution-External The behavior lies outside the functionality covered by this repository label Jun 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Inline delegate invocation #11578

[Proposal] Inline delegate invocation #11578

hypeartist commented May 26, 2016

HaloFour commented May 26, 2016

orthoxerox commented May 26, 2016

AlgorithmsAreCool commented May 26, 2016

hypeartist commented May 27, 2016 •

edited

Loading

AlgorithmsAreCool commented May 27, 2016

hypeartist commented May 27, 2016 •

edited

Loading

AlgorithmsAreCool commented May 27, 2016 •

edited

Loading

svick commented May 27, 2016

AlgorithmsAreCool commented May 27, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

AlgorithmsAreCool commented May 29, 2016

GSPP commented May 29, 2016 •

edited

Loading

AlgorithmsAreCool commented May 29, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

AlgorithmsAreCool commented May 29, 2016

GSPP commented May 29, 2016 •

edited

Loading

orthoxerox commented May 29, 2016

hypeartist commented May 30, 2016 •

edited

Loading

gafter commented Jun 3, 2016 •

edited

Loading

[Proposal] Inline delegate invocation #11578

[Proposal] Inline delegate invocation #11578

Comments

hypeartist commented May 26, 2016

HaloFour commented May 26, 2016

orthoxerox commented May 26, 2016

AlgorithmsAreCool commented May 26, 2016

hypeartist commented May 27, 2016 • edited Loading

AlgorithmsAreCool commented May 27, 2016

hypeartist commented May 27, 2016 • edited Loading

AlgorithmsAreCool commented May 27, 2016 • edited Loading

svick commented May 27, 2016

AlgorithmsAreCool commented May 27, 2016 • edited Loading

GSPP commented May 29, 2016 • edited Loading

AlgorithmsAreCool commented May 29, 2016

GSPP commented May 29, 2016 • edited Loading

AlgorithmsAreCool commented May 29, 2016 • edited Loading

GSPP commented May 29, 2016 • edited Loading

AlgorithmsAreCool commented May 29, 2016

GSPP commented May 29, 2016 • edited Loading

orthoxerox commented May 29, 2016

hypeartist commented May 30, 2016 • edited Loading

gafter commented Jun 3, 2016 • edited Loading

hypeartist commented May 27, 2016 •

edited

Loading

hypeartist commented May 27, 2016 •

edited

Loading

AlgorithmsAreCool commented May 27, 2016 •

edited

Loading

AlgorithmsAreCool commented May 27, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

AlgorithmsAreCool commented May 29, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

GSPP commented May 29, 2016 •

edited

Loading

hypeartist commented May 30, 2016 •

edited

Loading

gafter commented Jun 3, 2016 •

edited

Loading