-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move test::black_box to std and stabilize it #1484
Comments
|
I like the idea too, especially if it lets folks experiment with benchmarking in an external crate. Is the documentation of |
Probably not. E.g. a quick scan of the LLVM inline-asm docs implies it doesn't seem to say that it won't ever do any optimisations.
I imagine #[inline(never)]
fn not_cross_crate(x: *mut u8) {}
pub fn black_box<T>(dummy: T) -> T {
not_cross_crate(&dummy as *const _ as *mut u8);
dummy
} |
I don't think that we should stabilize |
Should we file an issue on LLVM about them providing some kind of black box with this guarantee? Between benchmark and timing attacks it’s not like we’re the only ones dealing with this problem. |
Perhaps, yeah, but it may not be too productive to do so without clearly figuring out what we want to actually do here. For example:
|
The point of
Since it's a compiler barrier, a good place to look is the Linux kernel. What we want is a mixture of
This statement produces no assembly (and the input and output registers are the same), but the optimizer is forced to assume:
|
It's not a no-op, and probably a naive idea, but would copying the value with It definitely prevented the elision of I would expect reading/rewriting the first byte of the value on-stack to be safe and of negligible cost while inducing the desired effect. It can be a no-op for ZSTs since they don't have any state to make assumptions of. Addendum: this seems to produce the desired effect of preventing elision of Version that covers ZSTs, branch is optimized out: https://is.gd/ApKyY7 |
@abonander Just for exploration and for future reference, the read_volatile thing is not enough. use std::ptr;
fn main() {
let buf = (Vec::<u8>::with_capacity(128), Vec::<u8>::with_capacity(64));
black_box(buf);
}
fn black_box<T>(val: T) -> T {
let _ = unsafe { ptr::read_volatile(&val as *const T as *const u8) };
val
} The allocation of the second vector will be optimized out, so it's using detailed information that only the first element of the tuple is needed. Using |
@bluss Copying the whole value, however, does prevent the elision of the second allocation: https://is.gd/6Oc0xR This also doesn't need to be special-cased. However, because it does force a copy of the full value it can be slow for larger types. |
@abonander Thanks, that does work well so far |
Standardizing a |
As an alternative, what happens if we pass a pointer to the value to an |
One workaround for the LTO issue is to use the same inline asm trick (inside the Since the LLVM clobbers-in-asm-block behaviour is what's being relied on anyways, it seems cleaner and simpler for Rust to provide a |
I did that on my clear_on_drop experiment. The trick is to use the inline asm inside the C function, since inline asm is already stable on both gcc and clang: https://github.com/cesarb/clear_on_drop/blob/master/src/hide.rs https://github.com/cesarb/clear_on_drop/blob/master/src/hide.c (Pay no attention to the "fallback" attempt using atomics in that file, I've recently found out that the optimizer can still see through it. Only the inline asm and the external C function are guaranteed to work. Also, the external C funcion was supposed to be using c_void, as can be seen going back a few revisions, but unfortunately c_void is missing on no_std). |
@alexcrichton has already summed up the issue with the inline-asm trick above:
I guess we could just stabilize it with a codegen test that ensures the semantics we want to guarantee? |
As an additional datapoint, it appears a volatile copy of a pointer to the value also prevents optimization: https://play.rust-lang.org/?gist=953e86d736b4225ddb1d1548c679cf65&version=nightly It's not a no-op but it's O(1) to the size of A volatile read of the pointer appears to have the same effect (double-&) and also looks a little cleaner: https://play.rust-lang.org/?gist=4890326faa607b8185db20e714f52986&version=nightly |
One can reason about the inline asm trick, as long as LLVM obeys the following basic premise: "the inline assembly instructions are opaque". That is, the compiler and optimizer know nothing about the inline assembly instruction(s), other than what is explicit in the template (inputs, outputs, clobbers). Of course, that means that the precise input and output specifications are very important, since all the guarantees are in them. For instance, the assembly in the first comment above:
It puts the address of And since that template has no outputs, it doesn't guarantee anything about the return value. Even if it's computed before the inline asm, it might get computed again, or even inlined. Now compare with my example above (hide.rs):
The value "flows through" the inline asm, so it can't be inlined; I'm also passing the whole value to the inline assembly ("*m"), instead of just its address ("r"), so the compiler would be forced to compute it before the inline asm, and also to assume that the inline asm has modified it. So yes, the current implementation might be "just a hack that appears to work", but with some effort choosing the correct modifiers, that doesn't have to be the case (modulo LLVM bugs, of course). |
@cesarb nitpick: in Rust you want |
While it does a little extra stuff when invoked in isolation (spilling |
I think that whether Rust provides a The |
I don't know if it helps much but |
I have a PR for the Benchmarking RFC Manishearth#1 which adds:
The PR contains an example of how to use these to benchmark |
@gnzlbg |
@eddyb @kennytm mentioned the same thing in the PR comments. I answered here: Manishearth#1 (comment) IIUC (which I am not sure I do), With |
But LLVM can still reorder writes to e.g. stack variables, based on escape/alias analysis, right? |
Right. The PR to the RFC precises a bit more what I meant with clobbering memory: Clobber is specified to guarantee that writes through pointers that have previously been fn bench_vec_push_back(bench: Bencher) -> BenchResult {
let n = 100_000_000;
let mut v = Vec::with_capacity(n);
bench.iter_n(n, || {
// Allow vector data to be clobbered:
mem::black_box(v.as_ptr());
v.push_back(42_u8);
// Forces writes through `v.as_ptr()` to be written back to memory,
// that is, 42 is written back to memory:
mem::clobber();
})
} The Google Benchmark library that these functions are based on specify them exactly like this and the docs particularly insist that clobber only works for variables that have previously been |
Ah, this makes a lot more sense, thanks!
I linked |
I think that the difference between IMO, the crux of the problem is that an entire computation may be omitted because of optimisation guarantees. So, IMO, there are two possible functions to aid this:
Note that the former is basically a |
To clarify exactly:
I misremembered the API for |
@clarcharr Could you show how to benchmark |
Is it possible to use the inline asm approach, which is a no-op, instead of a volatile read, which isn't? |
@gnzlbg I'd probably do a @hdevalence I don't think that any approach will really be a no-op, but considering how the value will already be in cache when the function runs, performance penalty will only be a few cycles max. |
So I've tried @clarcharr suggestion here: https://godbolt.org/g/HvZJZW and the three alternatives (
Maybe someone with more |
I've let this settle a bit, but I wanted to add two things: First, The purpose of That would rule out So I think it would be important to agree if this is a property that we want to preserve. Should functions whose intent is to just disable compiler optimizations be allowed to generate any code when, for example, used in isolation? As mentioned, I think that ruling out |
I'm still a proponent of using |
I agree with this statement. For completeness it is worth mentioning that a consequence of doing this is that the code being measured in the benchmarks using The google benchmark intrinsics intent is to allow benchmarking the exact same code that the compiler would have emitted in your real application. They might or might not succeed at this (this is hard, and it is not only up to them, but also up to the benchmark writer who must be extra careful), but that is at least what they attempt to enable doing. Since doing this is hard, this might not be the only tools that people will have while writing benchmarks though. One cool thing about |
I've opened a PR for this here: #2360 |
Closing in favor of #2360. |
The
test
crate is not stable, and I’m not aware of plans to stabilize it soon. It’s probably better to spend some time experimenting on crates.io usingharness = false
inCargo.toml
(see rust-lang/cargo#2305).I’ve extracted the test crate to eventually publish it separately and am removing usage of unstable features one by one. One of them is
asm
, used in:It’s an important part of benchmarking. Since
asm!
is not stable and also unlikely to be stable soon, I’d like to haveblack_box
stabilized instd
, as a building block for external test and benchmarking harnesses.I’m not sure what module it would go into. Maybe
std::time
?The text was updated successfully, but these errors were encountered: