Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Rust assume that there are unknown threads of execution? #215

Open
gnzlbg opened this issue Oct 18, 2019 · 45 comments
Open

Should Rust assume that there are unknown threads of execution? #215

gnzlbg opened this issue Oct 18, 2019 · 45 comments
Labels
A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) C-open-question Category: An open question that we should revisit

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 18, 2019

@Amanieu provided the following example here, where the Channel should be placed in inter-process shared memory:

// Single-producer single-consumer channel with a capacity of 1
pub struct Channel<T: Copy> {
    control: AtomicBool,
    data: UnsafeCell<T>,
}

// I'm keeping the code simple here by having both read() and write()
// available on Channel. A real implementation would split the ends into
// a ReadChannel and WriteChannel to ensure only a single producer/consumer.
impl<T: Copy> Channel<T> {
    pub fn read(&self) -> T {
        // Wait for the channel to be full
        while !self.control.load(Ordering::Acquire) {}
        
        // Read the data
        let data = unsafe { *self.data.get() };
        
        // Mark the channel as empty
        self.control.store(false, Ordering::Release);
        data
    }

    pub fn write(&self, data: T) {
        // Wait for the channel to be empty
        while self.control.load(Ordering::Acquire) {}
        
        // Write the data
        unsafe {
            *self.data.get() = data;
        }
        
        // Mark the channel as full
        self.control.store(true, Ordering::Release);
    }
}
  • If Rust proves that the Rust process is single-threaded, is it a sound optimization to replace the atomic loads/stores of the AtomicBool by non-atomic memory accesses ? If that optimization is sound, a different process attempting to access the bool introduces a data-race and the behavior is undefined.

  • What assumptions does LLVM do?


Note: replacing the atomic loads and stores with volatile atomic loads and stores would ensure that the example above is correct independently of the answer to these questions.

@Amanieu
Copy link
Member

Amanieu commented Oct 18, 2019

The illusion that Rust has a global view of the system goes out of the window the moment it makes an FFI call. This happens long before main is even called, in the libstd initialization code. And even then, there is the possibility of injecting arbitrary code into a process through LD_PRELOAD.

Even without these, Rust cannot make any assumptions about what an FFI call which returns a shared memory region might do. For all Rust knows the other end of the shared memory region could be C code within the same process (add a #[repr(C)] to Channel if it makes you feel better).

@Diggsey
Copy link

Diggsey commented Oct 18, 2019

On windows, there's also CreateRemoteThread which you can use to create a thread in another process.

@HadrienG2
Copy link

HadrienG2 commented Oct 18, 2019

@Amanieu Rust already assumes that FFI calls cannot do certain things. For example, we assume that an FFI function won't randomly read and write our entire address space, as tolerating this behavior would make almost any optimization of Rust memory accesses externally observable, and therefore invalid under an as-if rule of optimization legality. This is in spite of the fact that a function implemented using, say, x86 assembly, could perfectly do such a thing.

I think that assuming a global view of concurrency in the application would be consistent with this prior decision of disallowing certain behavior from FFI entitites for the sake of enabling more optimizations on the Rust side.

Conversely, I think that not assuming such a global view of concurrency would greatly harm rustc's future ability to optimize concurrency primitives such as atomics, by effectively making atomic as pessimistic as volatile ("assume that all reads and writes are externally observable") as soon as there is any remote possibility of memory sharing with the outside world. Which happens as soon as a pointer to a memory region has ever been sent to opaque FFI for any purpose, even if that was just a printf call.

As an alternative, we could have both maximal optimization capabilities in Rust code and enable FFI concurrency / interprocess communication by having both non-volatile atomics (for intra-process / pure Rust communication) and volatile atomics (for FFI threads, MMIO and interprocess communication). This is the direction which I would personally advocate.

As an aside, there is prior art for differentiating intra-process and inter-process synchronization at the operating system level. For example, this is the main semantic difference between Windows's Mutex and CriticalSection APIs (which in turn allows CriticalSection to be much simpler and faster in the common case than Mutex).

@Amanieu
Copy link
Member

Amanieu commented Oct 18, 2019

Which happens as soon as a pointer to a memory region has ever been sent to opaque FFI for any purpose, even if that was just a printf call.

Note that LLVM has an attribute specifically for this, nocapture, which indicates that a pointer passed into the function is not used after the function has returned. There is also the noalias attribute which, when applied to the return value, indicates that the returned pointer is unique (i.e. like a pointer returned from malloc).

Also consider that thread creation itself is an FFI call, where the thread closure is passed in as an argument. For this FFI call, any memory referred to by the thread closure must clearly be externally observable, yet there is no way for rustc to know this just from the signature of the call.

My point is that, unless specifically marked with attributes, any memory whose address is passed to or from an FFI call should be assumed to be externally accessible. And indeed this is how LLVM currently works.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 18, 2019

@Amanieu

Sorry for not being more specific, your observations that unknown function calls can do many things that could prevent the compiler from applying any optimizations are correct. Trying to be more specific, another potential way of specifying a shared memory region would be to encode a pointer to it in a static variable:

#![no_core] #![no_std] #![no_main]

extern "C" {
    static FOO: Channel<T>;
}

#[no_mangle] fn main() { ... }

Do you think it is legal for a Rust compiler to optimize the atomic accesses of the channel to non-atomic ones if it can prove from just looking at the Rust program that it has not been possible for other threads of execution to be created ?

@Amanieu
Copy link
Member

Amanieu commented Oct 18, 2019

In your code example the channel is an extern "C". This means that it may be accessed from outside the current compilation unit, which in turn means that the compiler must conservatively assume that it could be accessed from anywhere, even another thread.

Do you think it is legal for a Rust compiler to optimize the atomic accesses of the channel to non-atomic ones if it can prove from just looking at the Rust program that it has not been possible for other threads of execution to be created ?

How can the compiler even prove that no threads are created? The only case where this optimization is valid is on platforms that support neither threading nor signals, e.g. emscripten.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 18, 2019

which in turn means that the compiler must conservatively assume that it could be accessed from anywhere, even another thread.

If such other threads exist. If they don't, then that does not hold.

How can the compiler even prove that no threads are created?

You mention that creating a thread requires an unknown function call. If the program contains no unknown function calls, then no threads are created by that program.

@bjorn3
Copy link
Member

bjorn3 commented Oct 18, 2019

If the program contains no unknown function calls, then no threads are created by that program.

All programs without unknown function calls always either exit without doing anything but consuming CPU, or they crash. Either way they are not very useful. Even something as simple as printing something to stdout requires an unknown function call to libc. Even if libc is visible to LTO, it still requires inline asm.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 18, 2019

All programs without unknown function calls always either exit without doing anything but consuming CPU, or they crash.

#![no_core] #![no_std] #![no_main] #![no_entry] ...

extern "C" {
    static FOO: Channel<i32>;
}

#[no_mangle] 
fn main() {
    FOO.write(0);
    loop {
        if FOO.read() == 42 { 
             println!("Hello world");
        }
    }
 }

That program calls unknown functions, but it doesn't do so before interfacing with shared memory, so it could not have spawned a second thread that modifies FOO to set it to 42.

Can the compiler optimize that program to: fn main() { loop {} } ?

@Amanieu
Copy link
Member

Amanieu commented Oct 18, 2019

No because external C code may run before main (via static constructors, LD_PRELOAD, etc), and this code may modified the publicly visible Channel.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 18, 2019

No because external C code may run before main (via static constructors, LD_PRELOAD, etc), and this code may modified the publicly visible Channel.

So you are saying that this code can spawn threads that remain active after main starts, and such code can modify Channel after main initializes it ?

How is this code represented in the abstract machine ? (e.g. how should miri implement what this code does?)

@Lokathor
Copy link
Contributor

It seems natural to assume that any unknown code that runs before the rust main (via any mechanism) can do anything at all that code can do, including spawn threads.

@HadrienG2
Copy link

Compiler abstract machines have somewhat supernatural characteristics though, as that is needed to permit code optimization.

@bjorn3
Copy link
Member

bjorn3 commented Oct 18, 2019

So you are saying that this code can spawn threads that remain active after main starts, and such code can modify Channel after main initializes it ?

In this case an unknown function did get called before calling main. The function _start from libc is an unknown function which is the actual caller of main.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 19, 2019

@bjorn3 I've expanded the ... from the first line of the example to include #![no_entry]which should "fix" that.

@Lokathor
Copy link
Contributor

Then, yes, with a correct linker script and some sort of embedded environment you could attain rust that immediately executes with no previous code happening.

@Lokathor
Copy link
Contributor

Ignoring all sorts of problems like "there would be no println" and "your code didn't initialize the channel" and all that.

@Amanieu
Copy link
Member

Amanieu commented Oct 19, 2019

Sure, but Rust as a language doesn't know that main (or rather _start since this is #[no_entry]) is only called once at the start of the program before any other code. All it knows is that there is a main function which is public and which may be called by external code. For all it knows, main could be called concurrently from multiple threads.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 20, 2019

Sure, but Rust as a language doesn't know that main (or rather _start since this is #[no_entry]) is only called once at the start of the program before any other code. All it knows is that there is a main function which is public and which may be called by external code. For all it knows, main could be called concurrently from multiple threads.

Well, that's the question this issue asks: does it?

Can the Rust abstract machine assume that main is where the program starts and is called by a single thread of execution and therefore no other concurrent threads might have been created before it?

Or does the Rust abstract machine somehow model that other threads of execution might have been created before main and have spawned unknown code? What can this unknown code then do? How does miri model this ? What can or cannot unsafe code authors assume about what such code does ? (this is the UCGs after all)

@Amanieu makes the point that, in practice, such code exists, but we already optimize under the assumption that there are many things that such code is not allowed to do, so if we allow such code to exists, are any of those optimizations that we are already doing unsound ? (e.g. can that code escape a pointer to the main stack, and mutate any atomics in the stack behind that threads back ? etc.)

@Amanieu
Copy link
Member

Amanieu commented Oct 20, 2019

I don't know about the Rust abstract machine, but miri does not support FFI except for a small whitelisted subset, and therefore avoids this issue (it also doesn't support threads).

@Amanieu makes the point that, in practice, such code exists, but we already optimize under the assumption that there are many things that such code is not allowed to do, so if we allow such code to exists, are any of those optimizations that we are already doing unsound ? (e.g. can that code escape a pointer to the main stack, and mutate any atomics in the stack behind that threads back ? etc.)

The main assumption used by the optimizer (LLVM) is that external code cannot modify an "allocated object" (each local variable is allocated on the stack with a separate alloca LLVM instruction, but a struct is a single alloca) unless the address of that object is "leaked" to the external code.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 20, 2019

I don't know about the Rust abstract machine, but miri does not support FFI except for a small whitelisted subset, and therefore avoids this issue (it also doesn't support threads).

Since you were talking about x86_64-unknown-linux-gnu above, and miri supports running Rust code that starts with main for that target, aiming to detect all undefined behavior in such Rust code, miri needs to emulate what the code before main does or it is allowed to do.

If you are arguing that such code is allowed to spawn threads that modify Rust memory, then either miri needs to do the same (*) or miri cannot support such a target.

(*) EDIT: e.g. erroring on any non-atomic memory access to memory that such "before-main" threads of execution are allowed to access.

The main assumption used by the optimizer (LLVM) is that external code cannot modify an "allocated object" (each local variable is allocated on the stack with a separate alloca LLVM instruction, but a struct is a single alloca) unless the address of that object is "leaked" to the external code.

IIUC under your model, the threads of execution before main can access the main-thread stack (they need to start that thread somehow so they could own its handle), which would mean that all variables in that thread have been "leaked" to external code. Unless that's somehow restricted, that would make a lot of optimizations that we are currently performing unsound.

@RalfJung
Copy link
Member

My point is that, unless specifically marked with attributes, any memory whose address is passed to or from an FFI call should be assumed to be externally accessible. And indeed this is how LLVM currently works.

Agreed.

Compiler abstract machines have somewhat supernatural characteristics though, as that is needed to permit code optimization.

They better don't, we want to prove things about them. ;)
But probably some would call this or this "supernatural".


Can the Rust abstract machine assume that main is where the program starts and is called by a single thread of execution and therefore no other concurrent threads might have been created before it?

Or does the Rust abstract machine somehow model that other threads of execution might have been created before main and have spawned unknown code? What can this unknown code then do? How does miri model this ? What can or cannot unsafe code authors assume about what such code does ? (this is the UCGs after all)

I feel we talked about this on Zulip but it is probably a good idea to write this down somewhere a bit more permanently...

The way I view this discussion, what y'all are talking about is the initial state of the Rust Abstract Machine (R-AM). We could say that the compiler may assume that the program given by the user is started in the R-AM with empty memory and no other thread. That would permit some of the optimizations that were mentioned here. However, it seems more reasonable (and indeed it better matches what compilers do in practice) to say that the code compiled by the compiler must work for any initial state of the R-AM. More precisely, for any such initial state, if that execution of the R-AM does not enter the "UB" error state, then the optimized program must behave the same way.

So, there can already be some memory allocated, there can already be other threads running, and so on. Formally, this assumption gives us, I think, what we want: rustc cannot just blindly remove atomicity from the example in the OP just because main spawns no thread; it has to actually prove that the memory in which the channel is stored is private to the program. At the same time, if the other threads in the initial state are "bad" and cause UB by introducing data races, the compiler is allowed to ignore that case. IOW, unsynchronized access to static mut is still okay (if you only access it from one thread); but if the access is synchronized the compiler may not remove the synchronization.

This also means there's not really anything for Miri to model. Miri just picks a particular initial state when running the program. And likewise does the user when running the real program -- normally there is no other thread so what Miri does is fine. If the user runs the program in another initial state, say with another thread in parallel, they'd have to also run that thread in Miri (assuming Miri supports threads one day). But that's no different from, say, having to test Miri with the actual content of the file system if that's what you want to test.

@RalfJung RalfJung added C-open-question Category: An open question that we should revisit A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) labels Oct 21, 2019
@RalfJung
Copy link
Member

To follow-up on what I said earlier -- actually, I realized this morning that the usual way this is handled is by considering "contextual refinement" instead of "refinement" when defining compiler correctness. That's a strict generalization of this "initial state" business.

"refinement" means "every behavior exhibited by the target program is also a possible behavior of the source program". We have to state it this way because of non-determinism; in a deterministic language we could equivalently say "the source and the target program behave the same way". (Defining "behavior" and its equality here is subtle but orthogonal -- it amounts to something about the traces of externally visible events.)

"contextual refinement" means "pick some context (a program with a hole), and then the context filled with the compiled program must refine the context filled with the source program".

In other words, refinement is a "whole-program notion", one that assumes total knowledge of everything that happens. Contextual refinement is a modular concept, one that assumes we are just compiling part of the whole program here and no matter with which context this gets put together eventually, the compiled program must behave the same as (or more precisely, refine) the source program.

This is what typical compilers have to do anyway; all we need to be careful about is, when compiling main, not to assume that the context is empty. Basically we pretend that we do not know that main is the entry point. (And some might say it really isn't, start_fn is -- but the same goes there; during compilation we pretend that this is just another normal function.)

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 21, 2019

The dumb example I have in mind is:

fn foo() -> u32 {
   let x = AtomicU32::new(0);
   x.write(42, Ordering::SeqCst);
   x.read(Ordering::SeqCst)
}
fn main() {
    assert_eq!(foo(), 42); // Is this always true?
}

If when the abstract machine is initialized there are other threads of execution, and those threads have pointers to or own the main thread stack, them: can they could write to foo::x and modify its value concurrently, e.g., between the write and the read, so that foo does not always return 42?

What if x is a static instead of a local and:

fn foo() -> (u32, usize) {
   static x: AtomicU32 = AtomicU32::new(0);
   x.write(42, Ordering::SeqCst);
   (x.read(Ordering::SeqCst), &x as *const _ as usize)
}
fn main() {
    assert_eq!(foo().0, 42); // Is this always true?
}

? (those other threads before main could call foo and obtain the address of the static)

@bjorn3
Copy link
Member

bjorn3 commented Oct 21, 2019

I think we should forbid accessing stack allocations, which dont have their address leaked (like already assumed by LLVM), but assume other threads of execution. That would make the above example guaranteed to not panic, but when the AtomicU32 is in a static, it may panic.

@Amanieu
Copy link
Member

Amanieu commented Oct 21, 2019

@gnzlbg In the first case the address of x on the stack is never leaked outside the function, so the compiler is allowed to optimize foo to just 42. The key rule is that you are only allowed to access an object through a pointer that is derived from the address of that object (i.e. the address originally came from &x at some point in the past), otherwise behavior is undefined.

In the second case the address is indeed leaked (I'm assuming you mean pub fn foo since you mention other threads calling it), therefore the compiler must assume other threads may be concurrently modifying the atomic.

@RalfJung
Copy link
Member

RalfJung commented Oct 21, 2019

Agreed with @Amanieu. This matches the "contextual refinement" definition:

  • For the first case, the location is private to foo, there is no way the context could access it. (Remember that the Rust Abstract Machine does not specify stack allocations as being adjacent or following a stack discipline or anything like that; there is hardly any difference between stack and heap allocations. Hence the address of x is unpredictable, and any attempt to guess it will have many possible executions with UB, meaning the compiler can ignore them.)

  • For the second case, there is a context that breaks this: spawn two threads. One thread calls main(). The other thread calls foo(), takes the 2nd return value, casts it to &AtomicU32 and stores a 23 in it. This is a UB-free whole program where foo can return 23.

    That said, 42 is still always an allowed return value, so the optimization that removes the comparison in assert! and replaces it by true is still allowed!

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 21, 2019

I want Rust to optimize my Rust programs under the assumption that the code before main doesn't do any of this by default.

@RalfJung
Copy link
Member

Do you have an even remotely realistic motivating example for that wish?

We could certainly add a flag or so to basically let the compiler assume, when compiling a binary, that the context is empty. But that's not usually how compilers work, it seems.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 21, 2019

Do you have an even remotely realistic motivating example for that wish?

I mostly use two tier 1 targets (x86_64-apple-darwin and x86_64-unknown-linux-gnu) and in neither does the code before main spawn extra threads that concurrently mess with my programs statics for fun, so I don't see why I should pay the cost of that maybe happening there when it does not happen for the platforms I care about. I also don't know of any platform where this happens btw.

@hanna-kruppe
Copy link

The question is, what cost? What's a remotely realistic program (fragment) where useful (read: significant positive impact on some measure of codegen quality) optimization can be performed with this assumption, that could not (with reasonable effort) be performed without this assumption?

(It's also not true that the targets you mention can't have code before main that spawns extra threads, since those platforms support life-before-main even if the entry point is Rust. You may wish to exclude it from valid uses of your programs but if you mean that, then say that.)

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 22, 2019

@rkruppe

(It's also not true that the targets you mention can't have code before main that spawns extra threads, since those platforms support life-before-main even if the entry point is Rust. You may wish to exclude it from valid uses of your programs but if you mean that, then say that.)

Yes, I want to exclude that code from my programs.

The question is, what cost? What's a remotely realistic program (fragment) where useful (read: significant positive impact on some measure of codegen quality) optimization can be performed with this assumption, that could not (with reasonable effort) be performed without this assumption?

What do you mean by "reasonable effort" ? If I'm writing a single threaded program I want the compiler to be able to optimize away atomic memory accesses. I don't want to have to pick a completely set of dependencies that do not use atomics instead.

Part of Rust's design has been to limit what life before main can do. This proposal, however, appears to suggest introducing complex life before main in the language which, among other things, is able to spawn threads that interact with the Rust program, or maybe even something like: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=4b63a3b131423e3964816577756ed8c5 , which does not match what, e.g., the current Rust Design FAQ says (sect. "1.4 There is no life before or after main").

If you are writing a program that requires assuming that life before main is complex for correctness, currently, without such an assumption, you can still write a correct program with a reasonable amount of effort (e.g. using volatile or atomic volatile instead of just normal load/stores or atomics), only disabling compiler optimizations for those memory accesses for which you need the stronger guarantees.

OTOH, if we allow complex life before main, it is not clear to me how we could allow single-threaded programs to state that life before main is not complex such that this information can be used for optimizations, in such a way that makes sure that, e.g., code using Atomics for inter-process communication or similar fails gets diagnosed to prevent programmer errors or similar. I don't think a global knob would be a good idea for this. It also feels to me that those using multi-threading or inter-process memory should be able to do so, paying whatever is necessary for it, but that it would be wrong for single-threaded code (or multi-threaded code) that does not use IPC to have to pay for it.

Maybe this issue could be rephrased as: "How complex is life before main allowed to be?" (until now the goal was that there is absolutely no life before main).

@hanna-kruppe
Copy link

Rust the language does not provide any access to life-before-main features since it is not considered a good feature. It is, incidentially and at the user's own peril, possible to use it by wielding low-level non-portable facilities such as #[link_section] in a certain way, but this is frowned up upon. However, since essentially all C++ code uses life-before-main and Rust needs to be able to interact with C++ libraries, it has never been reasonable a remotely reasonable position that Rust code is wholly incompatible with code running before the Rust entry point.

It is especially extreme to suggest that some vague subset of plausible uses of life-before-main (say, starting a logger thread) ought to lead to UB when used in a program where the entry point happens to be written in Rust.

What do you mean by "reasonable effort" ? If I'm writing a single threaded program I want the compiler to be able to optimize away atomic memory accesses. I don't want to have to pick a completely set of dependencies that do not use atomics instead.

I was talking about reasonable effort on the part of the optimizer, not programmer.

To put it bluntly, you are writing paragraphs upon paragraphs predicated on these hypothetical optimizations mattering, but so far there is not a shred of evidence that they actually matter. Optimizations, and UB reserved for optimization purposes, needs to be worthwhile, both to justify the spec and implementation effort and to outweight the costs of imposing more UB on users. There are several a priori reasons to doubt that "assume an initial R-AM state with no other threads running" would have any measurable benefit in practice:

  • It only applies to the program slice from entry to the first point where the optimizer's ability to rule out spawning of further threads fails, which in many programs is not very long. This obviously includes FFI calls, but also indirect calls (function pointers and trait object methods) and even non-inlined direct calls (unless this notion of "does not spawn threads" is reified into a function attribute, and even then it depends on LTO).
  • Removing uncontended atomics is by itself rarely a significant win. It could hypothetically have some ripple effects (e.g., LLVM's nosync attribute will in the future play into propagating dereferencability information) but it still seems unlikely to me to be an enabler for big wins for many Rust programs.
  • In a lot of Rust code, synchronization is relatively rare unless the code actively uses parallelism (in which case the synchronization can't be removed, period). In particular, statics with interior mutability are quite rare. I find it difficult to believe that operating on these is anywhere close to being hot enough for the optimizations you discuss mattering.
  • There are a large number of tricks already available for optimization, and in many scenarios they make "assume no other threads at entry" redundant. For example, in a binary crate with LTO (which is kinda necessary for the optimizations you want anyway), the vast majority of functions and statics will get internal linkage meaning they are only accessible from code in the CGU. If pointers to them do not escape either, that already suffices for a lot of optimizations.

The way to solve this, as suggested before, is to showcase a somewhat realistic program, a plausible sequence of compiler optimizations of that program which require assuming that no other threads existed on entry, and hard numbers showing that the program is singificantly faster/smaller/etc. after the optimizations. The last part is key. It is easy to conjure up examples where the final assembly looks nicer, but what matters is whether it (for example) actually runs faster.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Oct 22, 2019

In a lot of Rust code, synchronization is relatively rare unless the code actively uses parallelism (in which case the synchronization can't be removed, period). In particular, statics with interior mutability are quite rare. I find it difficult to believe that operating on these is anywhere close to being hot enough for the optimizations you discuss mattering.

I don't think this is true. Precisely because Rust does not have life before main, a lot of Rust code needs to initialize statics on first use (e.g. if the initialization logic is not supported by const fn), and doing that requires synchronization. Single threaded programs using these libraries pay for this synchronization even though there is nothing to synchronize because these programs are single-threaded.

One example of a static in the standard library that gets used by hot code and that cannot currently be optimized for single-threaded programs is the target-feature cache used by is_x86_feature_detected!.

In a single threaded program, its usage expands to something like this:

static CACHE: AtomicU32 = ...;
fn main() {
    for _ in 0.. { // some hot computation loop
        let mut cache = CACHE.load(Ordering::Acquire);
        if cache.is_uninitialized() {
                cache.initialize();
                CACHE.store(cache, Ordering::Release);
        } 
        // dispatch to avx implementation if available
        if test_bit(cache, "avx") {
                avx_function()
        } else {
                fallback_function()
        }
    }
}

Under the model that external threads could have been created before main that modify Rust statics, the target-feature cache needs to be re-loaded at every loop iteration, since its value can change. If no external threads that can concurrently modify the static can be spawned before main, or if it would be a data-race for such threads to modify the static (https://godbolt.org/z/FXRc46), the load can be hoisted out of the loop, which enables further optimizations.

@Lokathor
Copy link
Contributor

Static initialization is generally performed by the OS, I thought, as part of its general contract with an executable to have a sane setup before the program gets started. Or, in an embedded context, well you probably have some sort of asm script that copies some stuff around or something like that.

@comex
Copy link

comex commented Oct 22, 2019

@gnzlbg

That sounds like an extremely fragile optimization. It would only work if:

  • The hot loop is inlined into main; this is probably not the case in most 'real' programs.
  • There are no FFI calls whatsoever (direct or indirect) before or within the loop... unless you also want to go around whitelisting FFI calls as being unable to spawn threads.
  • Within the loop, there are no writes to unknown locations or calls to unknown functions which (in the compiler's view) could end up modifying CACHE.
  • The program does not itself use threads. Obvious, but...

We know that in reality, once the cache is initialized it cannot be de-initialized. If only we had a way to communicate those semantics to the compiler, they would justify performing the same optimization in arbitrary functions, regardless of threading, regardless of FFI calls or unknown writes. That would be a much more robust approach.

Potential avenues include:

  • The compiler should already be able to assume that statics, like stack variables, will only be accessed by code that obtained their address 'legitimately'. That includes:

    • Code within the crate;
    • If building a library and the static is exported, clients of that library;
    • Any code which was handed a pointer to the variable by code that itself obtained the address legitimately. For instance, if something calls some_ffi_func(&CACHE), the compiler obviously has to assume that FFI code could access CACHE from that point forward.

    But it does not include code that, say, finds the address of CACHE from the symbol table despite it not being exported.

    In theory, if CACHE is private to the crate and all code that accesses it resembles your example, the compiler could perform a global analysis to conclude that it cannot be modified after being marked initialized. However, these kinds of global analyses are likely too expensive to perform in practice.

  • Unordered allows loop hoisting, and more generally has most of the same optimization potential as non-atomic accesses. If Rust supported Unordered, using it instead of Acquire would make your optimization legal without needing to assume single-threadedness. However, the optimization would still break in the presence of unknown writes or calls to unknown functions. (Therefore, as much as I'm eager for good use cases for Unordered, I'm hesitant to conclude that this represents one.)

  • C has the GCC extension __attribute__((const)) (LLVM equivalent readnone), which creates an unsafe assumption that a given function will always return the same value for a given set of arguments; thus it can always be hoisted regardless of the presence of unknown writes. (This is similar to __attribute__((pure)), but pure allows the function to use global variables and data accessible from there to determine its return value, so it can only be hoisted if the loop doesn't write to anything that could possibly be accessible from the function.)

    Here is an example of how a call to such a function can be hoisted out of a loop, even if the function's implementation is not pure: https://gcc.godbolt.org/z/R7syIh

    Unfortunately, this only works reliably if the function is not inlined, because after inlining the purity assumption is not retained in the IR. Even with that limitation, though, it may be practically useful in at least some cases, so it would be nice if we added a way to make rustc mark functions readnone.

    Regardless, it's certainly theoretically possible to improve this optimization to work better with inlining, and your single-threaded-assumption-based optimization is also purely theoretical at this point.

  • If you really want something that works today, there's always asm. ;) asm blocks with no volatile and no side effects can also be hoisted – in fact, LLVM internally treats asm blocks like function calls – but they don't result in an assembly-level call instruction, so they're more or less zero-overhead. Of course, using asm is a total hack, since the objective of "tell the optimizer this code can be hoisted out of a loop" shouldn't require architecture-dependent code. But in this particular example, is_x86_feature_detected is architecture-dependent anyway...

@glaebhoerl
Copy link

I don't know how relevant this is here, but GNU libstdc++ has an optimization where its shared_ptr uses non-atomic operations if it determines that the program cannot possibly have created a second thread.

@RalfJung
Copy link
Member

RalfJung commented Nov 5, 2019

@gnzlbg

Part of Rust's design has been to limit what life before main can do. This proposal, however, appears to suggest introducing complex life before main in the language which, among other things, is able to spawn threads that interact with the Rust program, or maybe even something like: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=4b63a3b131423e3964816577756ed8c5 , which does not match what, e.g., the current Rust Design FAQ says (sect. "1.4 There is no life before or after main").

That FAQ also says things like "match must be exhaustive". So should we stop supporting FFI with languages that do not make such requirements?

That is a FAQ for Rust itself, but we don't get to make the rules for how Rust interacts with the outside world. If I had my way, we'd not support FFI with C or C++ as those are messy and complicated. We'd not have padding in unions (as in, all their bytes get copied when passed to another function) and everything could be nice. Heck, we'd not even have unions, they have basically no use in pure Rust. But alas, we can't just wish the rest of the world away. We support interacting with C code where unions lose some of their bytes when passed by-value, and we likewise support interacting with C code that has life-before-main.

I am honestly quite flabbergasted that you out of all people suggest to severely restrict the interoperability of Rust with other languages like that.

@glaebhoerl

I don't know how relevant this is here, but GNU libstdc++ has an optimization where its shared_ptr uses non-atomic operations if it determines that the program cannot possibly have created a second thread.

I consider that as "poor man's Rc". Even if you write entirely sequential code, if you load any library that spawns a thread (which are many -- once you write UI applications, there's tons of threads for GUI and Audio and IPC and whatnot), your code has to pay the cost of thread-safe ref-counting.

@comex

We know that in reality, once the cache is initialized it cannot be de-initialized. If only we had a way to communicate those semantics to the compiler, they would justify performing the same optimization in arbitrary functions, regardless of threading, regardless of FFI calls or unknown writes. That would be a much more robust approach.

Good observation! Even in sequential code this optimization @gnzlbg asks for is far from trivial as literally any uncontrolled write anywhere in the program could break it. Having a compiler-understood notion of "write-once variables" would be very useful here.

@Ixrec
Copy link
Contributor

Ixrec commented Nov 5, 2019

Heck, we'd not even have unions, they have basically no use in pure Rust.

Not super relevant to this thread, but since this is a weirdly persistent misconception: There are significant pure-Rust use cases for unions. As the RFC puts it: "A native union mechanism would also simplify Rust implementations of space-efficient or cache-efficient structures relying on value representation, such as machine-word-sized unions using the least-significant bits of aligned pointers to distinguish cases."

(on topic, I basically agree with the (consensus?) view that the status quo of "there is other code out there, but it can only mess with memory that I explicitly give it references to" should remain, and the optimizations suggested so far can and should be enabled in other, more robust ways)

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Nov 6, 2019

@RalfJung

So should we stop supporting FFI with languages that do not make such requirements?

I really have no idea what you are talking about. Feels like you are building up a huge strawman that's kind of off-topic for this thread (What does FFI have to do with what Rust programs are allowed to do before main starts?), and kind of debunking it yourself.

I am honestly quite flabbergasted that you out of all people suggest to severely restrict the interoperability of Rust with other languages like that.

If that is what I did, that makes two of us. But I don't recall talking about this anywhere in this thread.

Having a compiler-understood notion of "write-once variables" would be very useful here.

👍

@bjorn3
Copy link
Member

bjorn3 commented Dec 15, 2023

The target specification can specify that the target doesn't support threads: https://github.com/rust-lang/rust/blob/4d1bd0db7f489b22c6d8aa2385937a95412c015b/compiler/rustc_target/src/spec/mod.rs#L2090 This will cause LLVM to replace all atomics with non-atomic loads and stores and assume that global variables don't get changed from another thread. The eBPF, UEFI and msp430 targets enable this option. And for wasm there is a special case to enable it when the atomics feature is disabled: https://github.com/rust-lang/rust/blob/4d1bd0db7f489b22c6d8aa2385937a95412c015b/compiler/rustc_codegen_llvm/src/back/write.rs#L205-L207

@RalfJung
Copy link
Member

Interesting.

But I guess even on those targets we should assume that there might have been "something" that happens before main, so e.g. every mutable static cannot be assumed to still have its initial value.

@Lokathor
Copy link
Contributor

Something happening before main is what needs to happen to give those statics their initial value on many platforms. otherwise they'd just be blank.

@RalfJung
Copy link
Member

That's different, that's just a detail of how we set up the real machine to match the AM state.

This is about other stuff happening, things rustc didn't ask the platform to do.

@gonzalobg
Copy link

The target specification can specify that the target doesn't support threads: https://github.com/rust-lang/rust/blob/4d1bd0db7f489b22c6d8aa2385937a95412c015b/compiler/rustc_target/src/spec/mod.rs#L2090 This will cause LLVM to replace all atomics with non-atomic loads and stores and assume that global variables don't get changed from another thread. The eBPF, UEFI and msp430 targets enable this option. And for wasm there is a special case to enable it when the atomics feature is disabled: https://github.com/rust-lang/rust/blob/4d1bd0db7f489b22c6d8aa2385937a95412c015b/compiler/rustc_codegen_llvm/src/back/write.rs#L205-L207

@bjorn3 I believe that if LLVM's ThreadModel::Single is defined as "this program performs no inter-thread synchronization", then the optimizations it currently perform wouldn't be incompatible with the unknown threads of execution being discussed here. It would however mean that a program compiled with ThreadModel::Single would not only not be able to spawn threads and synchronize with them, but that it also would not be able to synchronize with other threads outside the program, e.g., in a different process. To me that sounds "fair". Unfortunately, I could not find any docs whatsoever about ThreadModel::Single, and there are other thorny questions, e.g., what about signal handlers. Those might not be super relevant for WASM, but would be relevant for embedded "single-threaded" targets that might only have one thread, but might still have signal handlers.


BTW, reading my past self comments on this issue (5 years ago!) makes me ashamed. As i've expressed on Zulip since, I think the whole "unknown threads of execution beyond main" idea is a reasonable way to reduce the amount of use cases for volatile, which provides a both simpler and more performant programming model for inter-process communication. With this proposal, message passing across processes or across threads of the same process, has the same programming model and exhibits in practice the same performance, and this only comes at the cost of preventing "theoretical" optimizations that are impossible in practice for any real-world application. I do not believe it gets any better than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) C-open-question Category: An open question that we should revisit
Projects
None yet
Development

No branches or pull requests