Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior from poll_fn + futures::join! / std::future::join! #3780

Closed
jplatte opened this issue Jul 31, 2024 · 15 comments
Closed

Undefined behavior from poll_fn + futures::join! / std::future::join! #3780

jplatte opened this issue Jul 31, 2024 · 15 comments

Comments

@jplatte
Copy link
Contributor

jplatte commented Jul 31, 2024

Recently, when working on a library for observables, I found new miri test failures (I had been getting them on a specific test before, but didn't investigate after ensuring it wasn't my own UB). This lead to a bug report to tokio where Alice Ryhl showed that it didn't actually require tokio::spawn (which I had failed to get rid of in minimizing the issue), and the same issue was reproducible with join! + futures_executor.

I have now reduced the example much further, to this (works with the nightly std::future::join!, or futures::join!; playground):

#![feature(future_join)]

use std::{
    future::{join, poll_fn, Future},
    pin::pin,
    sync::Arc,
    task::{Context, Poll, Wake},
};

fn main() {
    let mut fut = pin!(async {
        let mut thing = "hello".to_owned();
        let fut1 = async move {
            poll_fn(|_| {
                thing += ", world";
                Poll::<()>::Pending
            })
            .await;
        };

        let fut2 = async {};

        join!(fut1, fut2).await;
    });

    let waker = Arc::new(DummyWaker).into();
    let mut ctx = Context::from_waker(&waker);
    _ = fut.as_mut().poll(&mut ctx);
    _ = fut.as_mut().poll(&mut ctx);
}

struct DummyWaker;

impl Wake for DummyWaker {
    fn wake(self: Arc<Self>) {}
}

Alice suggested that this may be an instance of rust-lang/rust#63818, is that correct? I was under the impression that the compiler for now doesn't implement alias restrictions for !Unpin types, but maybe this only applies to the LLVM backend, not miri?

@jplatte jplatte changed the title Undefined behavior from futures::join! / std::future::join! Undefined behavior from poll_fn + futures::join! / std::future::join! Jul 31, 2024
@RalfJung
Copy link
Member

RalfJung commented Aug 1, 2024

Miri honors !Unpin, so there must be something more subtle going on here.

Please include the error when reporting a bug. Here it is:

error: Undefined Behavior: trying to retag from <3188> for SharedReadWrite permission at alloc1123[0x30], but that tag does not exist in the borrow stack for this location
  --> src/main.rs:15:17
   |
15 |                 thing += ", world";
   |                 ^^^^^
   |                 |
   |                 trying to retag from <3188> for SharedReadWrite permission at alloc1123[0x30], but that tag does not exist in the borrow stack for this location
   |                 this error occurs as part of two-phase retag at alloc1123[0x30..0x48]
   |
   = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
   = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
help: <3188> was created by a Unique retag at offsets [0x30..0x48]
  --> src/main.rs:14:13
   |
14 | /             poll_fn(|_| {
15 | |                 thing += ", world";
16 | |                 Poll::<()>::Pending
17 | |             })
18 | |             .await;
   | |__________________^
help: <3188> was later invalidated at offsets [0x30..0x38] by a read access
  --> src/main.rs:23:9
   |
23 |         join!(fut1, fut2).await;
   |         ^^^^^^^^^^^^^^^^^
   = note: BACKTRACE (of the first span):
   = note: inside closure at src/main.rs:15:17: 15:22
   = note: inside `<std::future::PollFn<{closure@src/main.rs:14:21: 14:24}> as std::future::Future>::poll` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:151:9: 151:57
note: inside closure
  --> src/main.rs:18:14
   |
18 |             .await;
   |              ^^^^^
   = note: inside `<std::future::join::MaybeDone<{async block@src/main.rs:13:20: 13:30}> as std::future::Future>::poll` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/join.rs:183:38: 183:68
note: inside closure
  --> src/main.rs:23:9
   |
23 |         join!(fut1, fut2).await;
   |         ^^^^^^^^^^^^^^^^^
   = note: inside `<std::future::PollFn<{closure@/playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/join.rs:110:21: 110:30}> as std::future::Future>::poll` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:151:9: 151:57
note: inside closure
  --> src/main.rs:23:9
   |
23 |         join!(fut1, fut2).await;
   |         ^^^^^^^^^^^^^^^^^
note: inside closure
  --> src/main.rs:23:27
   |
23 |         join!(fut1, fut2).await;
   |                           ^^^^^
note: inside `main`
  --> src/main.rs:29:9
   |
29 |     _ = fut.as_mut().poll(&mut ctx);
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   = note: this error originates in the macro `join_internal` which comes from the expansion of the macro `join` (in Nightly builds, run with -Z macro-backtrace for more info)

@Darksonn
Copy link

Darksonn commented Aug 1, 2024

In the original bug report to Tokio, the UB was happening even though tokio::spawn was used instead of join!. So it may not be the join! macro's fault.

@taiki-e
Copy link
Member

taiki-e commented Aug 1, 2024

It seems adding move to poll_fn silences SB violation error: playground

@taiki-e
Copy link
Member

taiki-e commented Aug 1, 2024

adding move to poll_fn silences SB violation error

Or making thing &mut _: playground
(This reminds me of https://internals.rust-lang.org/t/surprising-soundness-trouble-around-pollfn/17484)

Btw, it is also possible to reproduce the error with a function, such as futures::future::join(), rather than a macro: playground

@oli-obk
Copy link
Contributor

oli-obk commented Aug 1, 2024

Minimized:

use std::{
    future::Future,
    pin::Pin,
    sync::Arc,
    task::{Context, Poll, Wake},
};

struct ThingAdder<'a> {
    thing: &'a mut String,
}

impl Future for ThingAdder<'_> {
    type Output = ();

    fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        unsafe {
            *self.get_unchecked_mut().thing += ", world";
        }
        Poll::Pending
    }
}

fn main() {
    let mut thing = "hello".to_owned();
    let fut = async move { ThingAdder { thing: &mut thing }.await };

    let mut fut = MaybeDone::Future(fut);
    let mut fut = unsafe { Pin::new_unchecked(&mut fut) };

    let waker = Arc::new(DummyWaker).into();
    let mut ctx = Context::from_waker(&waker);
    assert_eq!(fut.as_mut().poll(&mut ctx), Poll::Pending);
    assert_eq!(fut.as_mut().poll(&mut ctx), Poll::Pending);
}

struct DummyWaker;

impl Wake for DummyWaker {
    fn wake(self: Arc<Self>) {}
}

pub enum MaybeDone<F: Future> {
    Future(F),
    Done,
}
impl<F: Future<Output = ()>> Future for MaybeDone<F> {
    type Output = ();

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        unsafe {
            match *self.as_mut().get_unchecked_mut() {
                MaybeDone::Future(ref mut f) => Pin::new_unchecked(f).poll(cx),
                MaybeDone::Done => unreachable!(),
            }
        }
    }
}

Removing the MaybeDone::Done variant will also hide the UB

edit: was able to get rid of poll_fn, too

@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2024

The memory that the conflict occurs on is the fut itself. Do I understand correctly that async move means we are moving thing into the future, and then creating a reference to it? So the future has (at least) two fields, a String and a ThingAdder pointing to that string.

The match *self.as_mut().get_unchecked_mut() will read the discriminant of the MaybeDone, which has been niche-optimized to be stored inside the String. This read overlaps with where the ThingAdder points to, thus invalidating the mutable reference. Finally, the mutable reference is used again, causing UB.

In other words, this is the "mutable reference aliasing" version of why UnsafeCell blocks niches: we also need to block niches for fields that have self-referential references pointing to them. So the fix here is to change async lowering to use UnsafePinned and then have UnsafePinned block niches the same way UnsafeCell blocks niches.

@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2024

Since Miri is behaving as intended here, I'm going to close the issue in favor of rust-lang/rust#63818 and rust-lang/rust#125735. I don't see a way that we can work around this akin to the !Unpin hack; this fundamentally needs the layout of these futures to be different from what it is today.

Thanks for bringing this up, and for the very helpful minimization! In retrospect it is somewhat surprising that it took us so long to discover this, given that we already knew about the exact same issue with shared references and UnsafeCell...

Cc @rust-lang/opsem, just an FYI :)

@jplatte
Copy link
Contributor Author

jplatte commented Aug 2, 2024

In retrospect it is somewhat surprising that it took us so long to discover this

I'm just as surprised! 😄

Given that the &mut method I'm trying to call in the real code that made me look into this is on my own type (not String), is there something I can do hide its niches without using unsafe, as a temporary measure to unblock me from using miri to detect other potential unsoundness?

@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2024

If you only need &mut access, you could wrap the thing the mutable reference points to in an UnsafeCell. 😂 UnsafeCell::get_mut is safe, after all.

@Darksonn
Copy link

Darksonn commented Aug 2, 2024

Is there anything Tokio should be doing to avoid UB with niche-optimized enums? We have the MaybeDone enum:

pin_project! {
    /// A future that may have completed.
    #[derive(Debug)]
    #[project = MaybeDoneProj]
    #[project_replace = MaybeDoneProjReplace]
    pub enum MaybeDone<Fut: Future> {
        /// A not-yet-completed future.
        Future { #[pin] future: Fut },
        /// The output of the completed future.
        Done { output: Fut::Output },
        /// The empty variant after the result of a [`MaybeDone`] has been
        /// taken using the [`take_output`](MaybeDone::take_output) method.
        Gone,
    }
}

@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2024

I don't know if you should be doing anything, since it seems unlikely that this will lead to actual miscompilations. Though maybe @comex is able to trick LLVM into generating bad code here again. ;)

What you could do it mark that enum with a repr, which AFAIK prevents it from being niche-optimized.

The best thing to do would be to find someone who can implement UnsafePinned. :)

hawkw added a commit to hawkw/mycelium that referenced this issue Aug 2, 2024
This reproduces a potential UB where a `task::Cell` gets niche-optimized
in the `Joined` case. This is based on the test added to Tokio in
tokio-rs/tokio#6744
hawkw added a commit to hawkw/mycelium that referenced this issue Aug 3, 2024
hawkw added a commit to hawkw/mycelium that referenced this issue Aug 3, 2024
This reproduces a potential UB where a `task::Cell` gets niche-optimized
in the `Joined` case. This is based on the test added to Tokio in
tokio-rs/tokio#6744
hawkw added a commit to hawkw/mycelium that referenced this issue Aug 3, 2024
See [this comment][1]. Adding `#[repr(C)]` to this type suppresses niche
optimization, resolving potential unsoundness that could occur when the
compiler decides to niche-optimize a self-referential future type. In
practice this probably doesn't actually happen, but we should make sure
it never does.

[1]: rust-lang/miri#3780 (comment)
hawkw added a commit to hawkw/mycelium that referenced this issue Aug 3, 2024
This reproduces a potential UB where a `task::Cell` gets niche-optimized
in the `Joined` case. This is based on the test added to Tokio in
tokio-rs/tokio#6744
hawkw added a commit to hawkw/mycelium that referenced this issue Aug 3, 2024
See [this comment][1]. Adding `#[repr(C)]` to this type suppresses niche
optimization, resolving potential unsoundness that could occur when the
compiler decides to niche-optimize a self-referential future type. In
practice this probably doesn't actually happen, but we should make sure
it never does.

[1]: rust-lang/miri#3780 (comment)
@taiki-e
Copy link
Member

taiki-e commented Aug 20, 2024

What you could do it mark that enum with a repr, which AFAIK prevents it from being niche-optimized.

While this is adequate as a workaround for the specific case here, it does not appear to be a reasonable workaround for this problem as a whole. This is because the same issue can be caused by using Option instead of a custom enum like MaybeDone.

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=63ab1f219beadfb593aac4ec5e0952d7

@RalfJung
Copy link
Member

Yeah I did not mean to imply that this is a reasonable work-around for this case, I just pointed out the possibility.

I don't think there is a good work-around, this fundamentally needs rustc layout computation to change -- either by special-casing generators, or by implementing rust-lang/rust#125735.

@taiki-e
Copy link
Member

taiki-e commented Aug 20, 2024

I don't think there is a good work-around, this fundamentally needs rustc layout computation to change -- either by special-casing generators, or by implementing rust-lang/rust#125735.

I thought wrapping a coroutine with a MaybeUninit inside a future returned by an async block/async function (like crossbeam-rs/crossbeam#834) as a potential workaround.

In previous async lowing I think it was enough to change the GenFuture field from T to MaybeUninit<T>, but I'm not sure what should be done since such a type is not used in recent async lowing.

@RalfJung
Copy link
Member

Here's a proper work-around: rust-lang/rust#129313. It increases the size of some futures, though...

bors added a commit to rust-lang-ci/rust that referenced this issue Sep 4, 2024
…r-errors

Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang/miri#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 6, 2024
…r-errors

Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang/miri#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 7, 2024
…r-errors

Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang/miri#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 8, 2024
…r-errors

Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang/miri#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
RalfJung pushed a commit to RalfJung/miri that referenced this issue Sep 10, 2024
Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang/rust#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang/rust#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
lnicola pushed a commit to lnicola/rust-analyzer that referenced this issue Sep 25, 2024
Supress niches in coroutines to avoid aliasing violations

As mentioned [here](rust-lang/rust#63818 (comment)), using niches in fields of coroutines that are referenced by other fields is unsound: the discriminant accesses violate the aliasing requirements of the reference pointing to the relevant field. This issue causes [Miri errors in practice](rust-lang/miri#3780).

The "obvious" fix for this is to suppress niches in coroutines. That's what this PR does. However, we have several tests explicitly ensuring that we *do* use niches in coroutines. So I see two options:
- We guard this behavior behind a `-Z` flag (that Miri will set by default). There is no known case of these aliasing violations causing miscompilations. But absence of evidence is not evidence of absence...
- (What this PR does right now.) We temporarily adjust the coroutine layout logic and the associated tests until the proper fix lands. The "proper fix" here is to wrap fields that other fields can point to in [`UnsafePinned`](rust-lang/rust#125735) and make `UnsafePinned` suppress niches; that would then still permit using niches of *other* fields (those that never get borrowed). However, I know that coroutine sizes are already a problem, so I am not sure if this temporary size regression is acceptable.

`@compiler-errors` any opinion? Also who else should be Cc'd here?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants