Skip to content

Conversation

@wilzbach
Copy link
Contributor

@wilzbach wilzbach commented May 22, 2017

As mentioned on #5367, it would be great if the benchmarking function wouldn't allow compilers to optimize its to-be-tested functions away.

There are a couple of tricks to avoid this behavior, at least:

I guess this might be a controversial change, so I am very interested to hear your opinions. What strategy to avoid the optimization do you prefer?

@wilzbach wilzbach requested review from burner and jmdavis May 22, 2017 17:38
long _ticksElapsed; // Total time that the StopWatch ran before it was stopped last.
}

private void doNotOptimizeAway(T)(auto ref T t) @trusted @nogc nothrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to inline this in benchmark because otherwise we'll include getpid in the measurements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so do you want to stop & start the clock all the time?

sw.reset();
foreach (_; 0 .. n)
fun[i]();
doNotOptimizeAway(fun[i]());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test where fun returns void, that trip me up some time ago

{
import core.thread : getpid;
import core.stdc.stdio;
if(getpid() == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this calls getpid on every loop iteration? This seems a bit silly. What about e.g. benchmarking small kernels, where the cost of a getpid call is very much non-negligible (even with user-space caching)? There might also be some cache effects due to the extra function invocation (although I suppose that if this greatly affects your measurements, they are not very meaningful anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after if :)

This will be nonportable platform-dependent code, of which getpid is the most conservative.

@wilzbach for dmd I think you may simply insert asm {} and dmd will get very conservative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following idea for the unoptimizable test:

public __gshared void* _13f43be984760828c93138284d104611;
private void doNotOptimizeAway(T)(auto ref T t) @trusted @nogc nothrow
{
    if (&t == _13f43be984760828c93138284d104611) { ... }
}

Although undocumented and obfuscated, that global is technically public so the compiler must conservatively assume it may point anywhere within the address space, including the address of the parameter.

Works? cc @WalterBright @ibuclaw @klickverbot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correx: the global should have type const void*

Copy link
Contributor

@dnadlinger dnadlinger Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andralex: I don't think that would work. Let's say the parameter is allocated on the stack. After inlining doNotOptimizeAway, the compiler might be able to prove that the address of the stack slot doesn't escape, so the global can't be equal to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klickverbot Yah, but the stack is a free-for-all address space and I wonder whether the compiler acknowledges that. Meaning, do compilers assume a global may point to stack-allocated data (from a previous function) even though the current stack data does not escape?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andralex: I think the compiler would be free to make the assumption that this doesn't occur in the C++ object model. As for what actually happens, I just checked with LDC and LLVM 3.9 doesn't seem to assume that the pointers can't alias on its own (without extra metadata from the frontend encoding the fact that it can't).

However, there is another, much bigger problem: Just calling your doNotOptimizeAway on the result of the computation won't guarantee that the actual computation isn't hoisted out of the benchmark loop. In fact, this is something LDC actually does on a few examples I tried.

This is why Chandler suggests clobber in addition to escapedoNotOptimizeAway on its own doesn't prevent code motion.

Copy link
Contributor

@dnadlinger dnadlinger Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure how to go about implementing the latter in pure D, although I suppose if we had something like C++'s std::atomic_signal_fence, we might be able to reuse that. Scratch the part about atomic_signal_fence – whether this is guaranteed to work would again depend on the precise aliasing semantics.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klickverbot ah, clobber. We should add that too! Thanks!

@dnadlinger
Copy link
Contributor

Some inspiration for a minimal overhead solution: https://github.com/facebook/folly/blob/master/folly/Benchmark.h https://github.com/google/benchmark/blob/master/include/benchmark/benchmark_api.h

Unfortunately, a solution like this would be compiler-dependent in D as well.

@PetarKirov
Copy link
Member

To add to @klickverbot's suggestions, I highly recommend watching this talk on benchmarking. The key point is that in addition to preventing the compiler from deleting the function call altogether, we need to make sure we're not interfering the the measurements (by adding unrelated code) and we're not tempering with the compiler's optimization abilities.
Otherwise, we'll be measuring something different from the real world.

One of the techniques shown in presentation is the following:

void escape(void* p)
{
    asm volatile("" : : "g"(p) : "memory");
}

void clobber()
{
    asm volatile("" : : : "memory");
}

And as @klickverbot says, we need an analogous solution in D that works with DMD, GDC and LDC.

@JackStouffer
Copy link
Contributor

Why don't we just ask the compiler devs to add a new pragma to the language to solve this?

@wilzbach
Copy link
Contributor Author

Why don't we just ask the compiler devs to add a new pragma to the language to solve this?

CC @WalterBright @yebblies @UplinkCoder @ibuclaw

@UplinkCoder
Copy link
Member

tbh. you don't really need it for dmd :)
doing an empty switch in a function will disable the inliner.

@UplinkCoder
Copy link
Member

On a more serious note: We'd need to control somewhat which optimizations we disable,
Which is possible for DMD, but I don't know about LDC / GDC.
How much control do you guys have about the optimizer ?

@dnadlinger
Copy link
Contributor

How much control do you guys have about the optimizer ?

I'd say that's the wrong question to ask.

As demonstrated by the C++ benchmark library snippets (I can really recommend Chandler Carruth's talk for non-compiler writers interested in what's going on, by the way), it is perfectly possible to do what we need on both GCC and LLVM.

The challenge is rather to appropriately nail down the semantics for such barrier/fake use instructions such that their implications are understandable, while being minimal and portable.

@andralex
Copy link
Member

andralex commented Jun 4, 2017

We should add a module called benchmark of which doNotOptimizeAway would be a primitive. One question is whether this should go in druntime instead of phobos. It is highly platform-dependent so that would indicate druntime as the best place.

At any rate, portable interfaces for non-portable things are what the stdlib is about, so I'm in favor of adding this.

@ibuclaw
Copy link
Member

ibuclaw commented Jun 6, 2017

The barrier itself could be an intrinsic of druntime. Then doNotOptimizeAway can still remain in phobos, itself just a user of the intrinsic.

@andralex
Copy link
Member

andralex commented Jun 6, 2017

The barrier itself could be an intrinsic of druntime. Then doNotOptimizeAway can still remain in phobos, itself just a user of the intrinsic.

Just aiming for the simplest thing - i.e. make it part of druntime.

@JackStouffer
Copy link
Contributor

It seems like the only things blocking this is the getpid problem and an issue of future implementation. D doesn't support any platforms atm where the getpid call would be expensive. Secondly, since this is a private function, there's no reason this can't be swapped out when a better option presents itself.

I say we move forward with this.

@jmdavis
Copy link
Member

jmdavis commented Mar 7, 2018

Maybe this is a stupid question, but is there a reason why it wouldn't work to just mark a function that you don't want optimized away with pragma(inline, false)? From what I've seen, that prevents the function from being optimized out.

@JackStouffer
Copy link
Contributor

JackStouffer commented Mar 7, 2018

Can pragma(inline, false) be applied to lambda's? That's a very common use-case for benchmark.

@jmdavis
Copy link
Member

jmdavis commented Mar 7, 2018

Can pragma(inline, false) be applied to lambda's? That a very common use-case for benchmark.

I have no idea. I never use lambdas for benchmarking, because I always end up benchmarking stuff that's too complicated for that to make sense. Function attributes can be used on lambdas though, so it would make sense if a pragma like that can be used on them, and if it can't, then I'd say that it makes good sense to create an enhancement request for it. But if I'm not mistaken, and pragma(inline, false) actually does the job properly, it seems to me like a nice, simple solution to this problem.

@JackStouffer
Copy link
Contributor

JackStouffer commented Mar 7, 2018

Well, even if it did, the argument is that the standard library should do this for the user automatically, i.e. Phobos should make it hard for the user to get the wrong result in their benchmark.

@dlang-bot
Copy link
Contributor

dlang-bot commented Mar 11, 2018

Thanks for your pull request, @wilzbach!

Bugzilla references

Auto-close Bugzilla Severity Description
18593 enhancement std.datetime.stopwatch.benchmark shouldn't optimize away its functions

@wilzbach wilzbach force-pushed the dontoptimizeaway branch 2 times, most recently from 2410a32 to 94337d0 Compare March 11, 2018 21:44
@wilzbach
Copy link
Contributor Author

BTW there's also a very good explanation of clobber, escape and doNotOptimizeAway on StackOverflow.
However, it seems like asm allows an even easier solution:

asm
{
    call [fun + i];
    ret;
}

https://godbolt.org/g/WRFWNv (as you can see pragma(inline) doesn't work)

an issue of future implementation.

Yeah, in theory future implementation could optimize asm away, but

  • there's no overhead with this change anymore
  • `benchmark at the moment doesn't guarantee that it avoids optimization
  • dmd will never get so smart
  • if ldc or gdc ever get so smart, they could special-case this function or provide a custom API/method to avoid optimization, but as far as I see it, it's a problem of the future.

@ibuclaw
Copy link
Member

ibuclaw commented Mar 11, 2018

if ldc or gdc ever get so smart, they could special-case this function or provide a custom API/method to avoid optimization, but as far as I see it, it's a problem of the future.

@wilzbach - Luckily we have attributes for asm statements. In gdc, all asm blocks prevent optimizations unless you mark it as pure.

@dnadlinger
Copy link
Contributor

Can we make this forward to a new set of druntime intrinsics or something along the lines instead? The current PR is (obviously) x86-only for LDC, and GDC just doesn't support it. We'd like to avoid having to patch Phobos downstream as much as possible.

@jpf91
Copy link
Contributor

jpf91 commented Mar 11, 2018

Btw, on a quick look it seems all these solutions are meant to prevent dead code elimination. However, is there any solution to prevent 'result caching' for pure functions or inlined functions?

foreach (_; 0 .. n)
{
     escape(fun[i]());
     clobber();
}

Couldn't this still be rewritten to this:

auto tmp = fun[i]();
foreach (_; 0 .. n)
{
     escape(tmp);
     clobber();
}
?

@jmdavis
Copy link
Member

jmdavis commented Mar 11, 2018

However, is there any solution to prevent 'result caching' for pure functions or inlined functions?

pure is a non-issue so long as the function isn't called multiple times within the same statement. As I understand it, for better or worse, the results of a pure function are never really cached. The current statement may be adjusted such that the pure function is only called once, but once multiple statements are involved, the pure function will be called multiple times.

@jpf91
Copy link
Contributor

jpf91 commented Mar 11, 2018

pure is a non-issue so long as the function isn't called multiple times within the same statement. As I understand it, for better or worse, the results of a pure function are never really cached.

That's not true for gdc. If a function is strong pure (i.e. pure in the GCC backend sense), the GCC backend heavily optimizes calls to such functions including caching, merging calls, etc. (Of course the compiler must be able to 'see' two identical calls, but using inlining and maybe even lto compilers are quite good at that nowadays)

@jpf91
Copy link
Contributor

jpf91 commented Mar 12, 2018

Thinking about this some more, clobbering (all!) input arguments before calling the function should work for pure functions and even inlining problems. As the compiler no longer knows the exact values of arguments or whether these changed since the last iteration, it can not assume anything about the return value. So it has to always execute the complete function. (Or when inlining, the part of the function actually depending on the input value)

@ibuclaw does the memory clobber (asm volatile("" : : : "memory");) already take care of this, or do we explicitly have to 'write' the arguments in another asm block?

However, benchmark only uses functions which do not take any parameters. If these are pure (could also be inferred), then by definition any call is expected to return exactly the same value. So we'd need a different approach to defeat compiler optimizations there.

@ibuclaw
Copy link
Member

ibuclaw commented Jun 13, 2018

Why was this merged?

@n8sh
Copy link
Member

n8sh commented Jun 13, 2018

Because it was approved two weeks ago.

@n8sh
Copy link
Member

n8sh commented Jun 13, 2018

...and because I didn't read carefully enough. @ibuclaw @klickverbot My apologies.

Can we make this forward to a new set of druntime intrinsics or something along the lines instead?

With this soft opening, I mistook it as idle musing rather than a verdict on the viability of the PR. Can I get away with blaming this on cross-Atlantic differences in communication style?

@ibuclaw
Copy link
Member

ibuclaw commented Jun 13, 2018

Because it was approved two weeks ago.

I only see one approval. Two weeks and one year ago. :-)

@wilzbach wilzbach deleted the dontoptimizeaway branch August 22, 2018 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.