Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move test::black_box to std and stabilize it #1484

Closed
SimonSapin opened this issue Feb 1, 2016 · 39 comments
Closed

Move test::black_box to std and stabilize it #1484

SimonSapin opened this issue Feb 1, 2016 · 39 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@SimonSapin
Copy link
Contributor

The test crate is not stable, and I’m not aware of plans to stabilize it soon. It’s probably better to spend some time experimenting on crates.io using harness = false in Cargo.toml (see rust-lang/cargo#2305).

I’ve extracted the test crate to eventually publish it separately and am removing usage of unstable features one by one. One of them is asm, used in:

/// A function that is opaque to the optimizer, to allow benchmarks to
/// pretend to use outputs to assist in avoiding dead-code
/// elimination.
///
/// This function is a no-op, and does not even read from `dummy`.
#[cfg(not(all(target_os = "nacl", target_arch = "le32")))]
pub fn black_box<T>(dummy: T) -> T {
    // we need to "use" the argument in some way LLVM can't
    // introspect.
    unsafe { asm!("" : : "r"(&dummy)) }
    dummy
}
#[cfg(all(target_os = "nacl", target_arch = "le32"))]
#[inline(never)]
pub fn black_box<T>(dummy: T) -> T {
    dummy
}

It’s an important part of benchmarking. Since asm! is not stable and also unlikely to be stable soon, I’d like to have black_box stabilized in std, as a building block for external test and benchmarking harnesses.

I’m not sure what module it would go into. Maybe std::time?

@sfackler
Copy link
Member

sfackler commented Feb 1, 2016

std::mem might make more sense as a location, but this seems like a reasonable thing to do regardless.

@sfackler sfackler added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Feb 1, 2016
@BurntSushi
Copy link
Member

I like the idea too, especially if it lets folks experiment with benchmarking in an external crate. std::mem sounds betterish than std::time I think.

Is the documentation of black_box presented here something we can guarantee? What happens on nacl?

@huonw
Copy link
Member

huonw commented Feb 1, 2016

Is the documentation of black_box presented here something we can guarantee?

Probably not. E.g. a quick scan of the LLVM inline-asm docs implies it doesn't seem to say that it won't ever do any optimisations.

What happens on nacl?

I imagine black_box has absolutely no effect. I guess something like the following might work better (at some cost), although it will likely fail with LTO.

#[inline(never)]
fn not_cross_crate(x: *mut u8) {}

pub fn black_box<T>(dummy: T) -> T {
    not_cross_crate(&dummy as *const _ as *mut u8);
    dummy
}

@alexcrichton
Copy link
Member

I don't think that we should stabilize black_box as it's currently "just a hack that appears to work". We don't really have any guarantees on behalf of LLVM nor do we really know what it does, it just emperically seems to not optimize things away. A solution like @huonw mentioned of a non-cross-crate function (e.g. something that forces codegen) would unfortunately have runtime overhead (which in theory this doesn't).

@SimonSapin
Copy link
Contributor Author

Should we file an issue on LLVM about them providing some kind of black box with this guarantee? Between benchmark and timing attacks it’s not like we’re the only ones dealing with this problem.

@alexcrichton
Copy link
Member

Perhaps, yeah, but it may not be too productive to do so without clearly figuring out what we want to actually do here. For example:

  • Do we just want the value to not be optimized away?
  • Do we want to prevent constant propagation?
  • Do we want LLVM to think this isn't a constant even though it is?
  • Are there other optimizations we want to prevent from happening? (we probably can't just say "don't optimize this value")

@cesarb
Copy link
Contributor

cesarb commented Feb 10, 2016

The point of black_box is to behave as an "optimization barrier" for a variable. It should do two things:

  • On entry to black_box, it should force the computation of its argument. The optimizer has to assume black_box depends on its argument and anything it might possibly point to;
  • On exit from black_box, it should assume nothing about its return value. The optimizer has to assume black_box might have produced its return value from thin air. In particular, even if the argument to black_box was &T, the optimizer has to assume black_box might have returned a different &T, pointing to different contents.

Since it's a compiler barrier, a good place to look is the Linux kernel. What we want is a mixture of barrier_data and OPTIMIZER_HIDE_VAR. That is, what we want is the equivalent of the following inline assembly statement in GCC's syntax (Rust's syntax is slightly different):

__asm__ __volatile__ ("" : "=r" (var) : "0" (var) : "memory")

This statement produces no assembly (and the input and output registers are the same), but the optimizer is forced to assume:

  • This statement should not be reordered with other statements, even if they seem to be unrelated;
  • This statement might depend unpredictably on the input variable, and whatever it points to;
  • This statement might return a different unpredictable value for the input variable, or if allowed (in case of &mut T or similar) might unpredictably modify whatever it points to.

@abonander
Copy link

abonander commented Jul 4, 2016

It's not a no-op, and probably a naive idea, but would copying the value with ptr::read_volatile() and/or ptr::write_volatile() have the intended effect?

It definitely prevented the elision of memset() for zeroing a buffer on-drop in this case, though I used it to read the first byte in the buffer instead of the Vec object on-stack.

I would expect reading/rewriting the first byte of the value on-stack to be safe and of negligible cost while inducing the desired effect. It can be a no-op for ZSTs since they don't have any state to make assumptions of.

Addendum: this seems to produce the desired effect of preventing elision of memset(), even without #[inline(never)]: https://is.gd/T6Vbb4

Version that covers ZSTs, branch is optimized out: https://is.gd/ApKyY7

@bluss
Copy link
Member

bluss commented Aug 18, 2016

@abonander Just for exploration and for future reference, the read_volatile thing is not enough.

playpen link

use std::ptr;

fn main() {
    let buf = (Vec::<u8>::with_capacity(128), Vec::<u8>::with_capacity(64));

    black_box(buf);
}

fn black_box<T>(val: T) -> T {
    let _ = unsafe { ptr::read_volatile(&val as *const T as *const u8) };
    val
}

The allocation of the second vector will be optimized out, so it's using detailed information that only the first element of the tuple is needed. Using test::black_box, none of the vecs are optimized out, like we want.

@abonander
Copy link

@bluss Copying the whole value, however, does prevent the elision of the second allocation: https://is.gd/6Oc0xR

This also doesn't need to be special-cased. However, because it does force a copy of the full value it can be slow for larger types.

@bluss
Copy link
Member

bluss commented Aug 22, 2016

@abonander Thanks, that does work well so far

@hdevalence
Copy link

Standardizing a black_box function with the behaviour @cesarb described would be very useful in a bunch of different contexts beyond just benchmarking, and it's much smaller than, say, stabilizing inline asm.

@abonander
Copy link

abonander commented Nov 21, 2017

As an alternative, what happens if we pass a pointer to the value to an extern "C" function defined somewhere else, maybe in the stdlib or in some C code? LLVM can't introspect that, right, and must assume it's going to be used? But I'm guessing LTO can introspect that and optimize it out unless it's in a dylib, but LTO would probably also optimize out the current impl so I don't know.

@hdevalence
Copy link

One workaround for the LTO issue is to use the same inline asm trick (inside the .c file), for instance https://github.com/cesarb/clear_on_drop/blob/master/src/hide.c

Since the LLVM clobbers-in-asm-block behaviour is what's being relied on anyways, it seems cleaner and simpler for Rust to provide a black_box function instead of wrapping the inline asm in a .c file.

@cesarb
Copy link
Contributor

cesarb commented Nov 21, 2017

I did that on my clear_on_drop experiment. The trick is to use the inline asm inside the C function, since inline asm is already stable on both gcc and clang:

https://github.com/cesarb/clear_on_drop/blob/master/src/hide.rs

https://github.com/cesarb/clear_on_drop/blob/master/src/hide.c

(Pay no attention to the "fallback" attempt using atomics in that file, I've recently found out that the optimizer can still see through it. Only the inline asm and the external C function are guaranteed to work. Also, the external C funcion was supposed to be using c_void, as can be seen going back a few revisions, but unfortunately c_void is missing on no_std).

@abonander
Copy link

@alexcrichton has already summed up the issue with the inline-asm trick above:

[...] it's currently "just a hack that appears to work". We don't really have any guarantees on behalf of LLVM nor do we really know what it does, it just emperically seems to not optimize things away.

I guess we could just stabilize it with a codegen test that ensures the semantics we want to guarantee?

@abonander
Copy link

abonander commented Nov 21, 2017

As an additional datapoint, it appears a volatile copy of a pointer to the value also prevents optimization: https://play.rust-lang.org/?gist=953e86d736b4225ddb1d1548c679cf65&version=nightly

It's not a no-op but it's O(1) to the size of T which is effectively the same. I don't know if this covers every angle of black_box but it has the additional benefit of working as-is on stable. I guess it forces LLVM to assume that the pointer might have been sent over shared memory or something.

A volatile read of the pointer appears to have the same effect (double-&) and also looks a little cleaner: https://play.rust-lang.org/?gist=4890326faa607b8185db20e714f52986&version=nightly

@cesarb
Copy link
Contributor

cesarb commented Nov 22, 2017

One can reason about the inline asm trick, as long as LLVM obeys the following basic premise: "the inline assembly instructions are opaque". That is, the compiler and optimizer know nothing about the inline assembly instruction(s), other than what is explicit in the template (inputs, outputs, clobbers).

Of course, that means that the precise input and output specifications are very important, since all the guarantees are in them. For instance, the assembly in the first comment above:

asm!("" : : "r"(&dummy))

It puts the address of dummy in a register, therefore it's guaranteed that this address is computed before that point. I'm not enough of an inline assembly wizard to be sure what this means about the contents of that address, however; I don't think this forces dummy to be computed before this point.

And since that template has no outputs, it doesn't guarantee anything about the return value. Even if it's computed before the inline asm, it might get computed again, or even inlined.

Now compare with my example above (hide.rs):

asm!("" : "=*m" (ptr) : "*0" (ptr));

The value "flows through" the inline asm, so it can't be inlined; I'm also passing the whole value to the inline assembly ("*m"), instead of just its address ("r"), so the compiler would be forced to compute it before the inline asm, and also to assume that the inline asm has modified it.

So yes, the current implementation might be "just a hack that appears to work", but with some effort choosing the correct modifiers, that doesn't have to be the case (modulo LLVM bugs, of course).

@eddyb
Copy link
Member

eddyb commented Nov 22, 2017

@cesarb nitpick: in Rust you want &mut T because, in the absence of UnsafeCell, we guarantee that &T cannot be mutated through, and that could be taken advantage of later.

@abonander
Copy link

While it does a little extra stuff when invoked in isolation (spilling %rax to the stack and then overwriting it), the volatile-ops-on-pointer approach produces assembly that is near-indistinguishable from the inline-asm approach when inlined.

@hdevalence
Copy link

I think that whether Rust provides a black_box function with defined semantics is much more important than the details of how it happens to be implemented, whether using an inline asm block or any other method.

The black_box function can be a black box; what's important is that it's in one place and has defined semantics that people can rely on (i.e., "it is a bug if this optimization barrier stops working").

@abonander
Copy link

I don't know if it helps much but read_volatile and write_volatile seem to have relatively well defined semantics, at least if you consider "whatever C11 specifies" to mean 'defined'. Inline asm seems to be more like "whatever LLVM does, specified or not".

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 1, 2018

I have a PR for the Benchmarking RFC Manishearth#1 which adds:

  • mem::black_box(x): x is used by "something" with side-effects and as a consequence cannot be optimized away. The expression producing x can be optimized by the compiler.

  • mem::clobber(): at the clobber all memory is ready by "something" with side-effects, and thus any memory modifications by previous code must have been committed before the clobber. This forces the compiler to flush all pending writes to memory that might otherwise have been optimized away.

The PR contains an example of how to use these to benchmark Vec::push.

@eddyb
Copy link
Member

eddyb commented Feb 1, 2018

@gnzlbg mem::clobber reminds me of sync::atomic::compiler_fence, although I'm not sure what interactions it has with the situations you're interested in (or if it's even possible to flush all writes).

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 1, 2018

@eddyb @kennytm mentioned the same thing in the PR comments. I answered here: Manishearth#1 (comment)

IIUC (which I am not sure I do), compiler_fence synchronizes reads/writes across multiple threads by preventing reordering instructions across "fences" of different strengths depending on the Ordering. If the compiler can prove that no other thread reads the result of a write, it still can optimize a write away, what it can't do is reorder some statements across the fence, for example.

With clobber the current thread does a volatile read of all memory and performs a side-effecting operation based on it. The compiler can still reorder across the clobber, but all pending writes must execute before it. Also, because this all happens within a single thread no memory fences / synchronization instructions need to be emitted (e.g. data-races can't happen). Does this make sense?

@eddyb
Copy link
Member

eddyb commented Feb 1, 2018

But LLVM can still reorder writes to e.g. stack variables, based on escape/alias analysis, right?
Since they can never alias the volatile read without UB.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 1, 2018

But LLVM can still reorder writes to e.g. stack variables, based on escape/alias analysis, right?

Right. The PR to the RFC precises a bit more what I meant with clobbering memory:

Clobber is specified to guarantee that writes through pointers that have previously been black_boxed are flushed to memory. So for example in the case of benchmarking Vec::push, we first need to black_box Vec's pointer and then after pushing we flush the writes through that pointer:

fn bench_vec_push_back(bench: Bencher) -> BenchResult {
    let n = 100_000_000;
    let mut v = Vec::with_capacity(n);
    bench.iter_n(n, || {
        // Allow vector data to be clobbered:
        mem::black_box(v.as_ptr());
        v.push_back(42_u8);
        // Forces writes through `v.as_ptr()` to be written back to memory,
        // that is, 42 is written back to memory:
        mem::clobber();
    })
}

The Google Benchmark library that these functions are based on specify them exactly like this and the docs particularly insist that clobber only works for variables that have previously been black_boxed (they call their black_box function DoNotOptimize).

@eddyb
Copy link
Member

eddyb commented Feb 1, 2018

Clobber is specified to guarantee that writes through pointers that have previously been black_boxed are flushed to memory.

Ah, this makes a lot more sense, thanks!

Also, because this all happens within a single thread no memory fences / synchronization instructions need to be emitted (e.g. data-races can't happen).

I linked compiler_fence and not fence specifically because of this.
Oh and I think black_box on the pointer means LLVM has to assume the pointer might have escaped to another thread, so I'm not sure we need additional machinery here.

@clarfonthey
Copy link
Contributor

clarfonthey commented Feb 2, 2018

I think that the difference between black_box and clobber is confusing and very error-prone. Perhaps the name black_box is itself to vague and a better name should be chosen?

IMO, the crux of the problem is that an entire computation may be omitted because of optimisation guarantees. So, IMO, there are two possible functions to aid this:

/// The compiler forces `x` to have a known value. This prevents optimisations across the boundaries of this function, although this does not guarantee that `x` is evaluated if the result is ultimately unused.
fn value_fence<T>(X: T) -> T;

/// The compiler forces `x` to have a known value, ensuring that all the computations required to do so are executed. After this, the value is dropped.
fn evaluate_and_drop<T>(X: T);

Note that the former is basically a read_volatile and forget, while the latter appears to be a write_volatile and forget.

@clarfonthey
Copy link
Contributor

To clarify exactly:

fn value_fence<T>(x: T) -> T {
    let y = unsafe { (&x as *const T).read_volatile() };
    std::mem::forget(x);
    y
}

fn evaluate_and_drop<T>(x: T) {
    let y = unsafe { std::mem::uninitialised() };
    unsafe { (&y as *const T).write_volatile(x); }
    drop(y); // not necessary but for clarity
}

I misremembered the API for write_volatile and no forget is required.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 2, 2018

@clarcharr Could you show how to benchmark Vec::push with them? That would help me understand them better. I posted the Vec::push example here: #1484 (comment)

@hdevalence
Copy link

hdevalence commented Feb 3, 2018

Is it possible to use the inline asm approach, which is a no-op, instead of a volatile read, which isn't?

@clarfonthey
Copy link
Contributor

@gnzlbg I'd probably do a vec.push(); evaluate_and_drop(vec.last())

@hdevalence I don't think that any approach will really be a no-op, but considering how the value will already be in cache when the function runs, performance penalty will only be a few cycles max.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 3, 2018

So I've tried @clarcharr suggestion here: https://godbolt.org/g/HvZJZW and the three alternatives (test::black_box, clobber, and ptr::write_volatile) all produce slightly different code:

  • test::black_box: 28 asm instructions
  • ptr::write_volatile: 27 asm instructions
  • clobber: 26 asm instructions

Maybe someone with more asm experience than me could look at their disassemblies (shown in godbolt) and comment on what they are exactly doing. Running these through cargo bench, they all report the exact same timings for Vec::push (in my machine 3ns per iter +/- 0).

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 9, 2018

I've let this settle a bit, but I wanted to add two things:

First, ptr::{read,write}_volatile are already available in std, that is, one can already use them when writing benchmarks. The benchmarking docs do not mention that these might be useful for this purpose, but in my opinion that is a documentation issue.

The purpose of test::black_box (and similar functions, like clobber) is to disable compiler optimizations. I think that functions with this purpose should not generate any code.

That would rule out value_fence and evaluate_and_drop, at least with their proposed implementation, but both test::black_box, and google_benchmark::{black_box, clobber} pass this litmus test.

So I think it would be important to agree if this is a property that we want to preserve. Should functions whose intent is to just disable compiler optimizations be allowed to generate any code when, for example, used in isolation?

As mentioned, I think that ruling out value_fence and evaluate_and_drop is not a downside, because they can be implemented in stable rust already without issues. OTOH the other functions being discussed require inline assembly tailored to LLVM, so we would currently need to expose them via std to make them available on stable and also to make sure that they work if we ever get another backend like Cranelift (Google Benchmark has implementations for MSVC, GCC, and other compilers, and these are tailored to the inline assembly functionality of each compiler, which support different features).

@abonander
Copy link

abonander commented Feb 9, 2018

I'm still a proponent of using ptr::write_volatile(); the version that performs the volatile write on a pointer to the data is O(1) to the size of the data itself, meaning its overhead is constant across all invocations and thus can be factored out of performance comparisons. I haven't done enough testing to be sure that it guarantees everything we want out of black_box() but it seems like not too bad of a place to start.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 9, 2018

I'm still a proponent of using ptr::write_volatile(); the version that performs the volatile write on a pointer to the data is O(1) to the size of the data itself, meaning its overhead is constant across all invocations and thus can be factored out of performance comparisons.

I agree with this statement. For completeness it is worth mentioning that a consequence of doing this is that the code being measured in the benchmarks using ptr::write_volatile() is not the same code that will run in your application. The assembly tests I showed above seem to corroborate this statement.

The google benchmark intrinsics intent is to allow benchmarking the exact same code that the compiler would have emitted in your real application. They might or might not succeed at this (this is hard, and it is not only up to them, but also up to the benchmark writer who must be extra careful), but that is at least what they attempt to enable doing. Since doing this is hard, this might not be the only tools that people will have while writing benchmarks though. One cool thing about test::black_box is that is extremely simple to use.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 12, 2018

I've opened a PR for this here: #2360

@Centril
Copy link
Contributor

Centril commented Oct 7, 2018

Closing in favor of #2360.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests