Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a soundness issue with the component model and async #6509

Merged
merged 5 commits into from
Jun 2, 2023

Conversation

alexcrichton
Copy link
Member

This commit addresses #6493 by fixing a soundness issue with the async
implementation of the component model. This issue has been presence
since the inception of the addition of async support to the component
model and doesn't represent a recent regression. The underlying problem
is that one of the base assumptions of the trap handling code is that
there's only one single activation in TLS that needs to be pushed/popped
when a stack is switched (e.g. a fiber is switched to or from). In the
case of the component model there might be two activations: one for an
invocation of a component function and then a second for an invocation
of a realloc function to return results back to wasm (e.g. in the case
an imported function returns a list).

This problem is fixed by changing how TLS is managed in the presence of
fibers. Previously when a fiber was suspended it would pop a single
activation from the top of the stack and save that to get pushed when
the fiber was resumed. This has the benefit of maintaining an entire
linked list of activations for the current thread but has the problem
above where it doesn't handle a fiber with multiple activations on it.
Instead now TLS management is done when a fiber is resumed instead of
suspended. Instead of pushing/popping a single activation the entire
linked list of activations is tracked for a particular fiber and stored
within the fiber itself. In this manner resuming a fiber will push
all activations onto the current thread and suspending a fiber will pop
all activations for the fiber (and store them as a new linked list in
the fiber's state itself).

This end result is that all activations on a fiber should now be managed
correctly, regardless of how many there are. The main downside of this
commit is that fiber suspension and resumption is more complicated, but
the hope there is that fiber suspension typically corresponds with I/O
not being ready or similar so the order of magnitude of TLS operations
isn't too significant compared to the I/O overhead.

Currently this uses tokio's `spawn_blocking` but that will reuse threads
in its thread pool. Instead spawn a thread and perform a single poll on
that thread to force lots of fresh threads to be used and ideally stress
TLS management further.
This commit adds a guard to Wasmtime's async support to double-check
that when a call to `poll` is finished that the currently active TLS
activation pointer does not point to the stack that is being switched
off of. This is attempting to be a bit of a defense-in-depth measure to
prevent stale pointers from sticking around in TLS. This is currently
happening and causing bytecodealliance#6493 which can result in unsoundness but
currently is manifesting as a crash.
This commit addresses bytecodealliance#6493 by fixing a soundness issue with the async
implementation of the component model. This issue has been presence
since the inception of the addition of async support to the component
model and doesn't represent a recent regression. The underlying problem
is that one of the base assumptions of the trap handling code is that
there's only one single activation in TLS that needs to be pushed/popped
when a stack is switched (e.g. a fiber is switched to or from). In the
case of the component model there might be two activations: one for an
invocation of a component function and then a second for an invocation
of a `realloc` function to return results back to wasm (e.g. in the case
an imported function returns a list).

This problem is fixed by changing how TLS is managed in the presence of
fibers. Previously when a fiber was suspended it would pop a single
activation from the top of the stack and save that to get pushed when
the fiber was resumed. This has the benefit of maintaining an entire
linked list of activations for the current thread but has the problem
above where it doesn't handle a fiber with multiple activations on it.
Instead now TLS management is done when a fiber is resumed instead of
suspended. Instead of pushing/popping a single activation the entire
linked list of activations is tracked for a particular fiber and stored
within the fiber itself. In this manner resuming a fiber will push
all activations onto the current thread and suspending a fiber will pop
all activations for the fiber (and store them as a new linked list in
the fiber's state itself).

This end result is that all activations on a fiber should now be managed
correctly, regardless of how many there are. The main downside of this
commit is that fiber suspension and resumption is more complicated, but
the hope there is that fiber suspension typically corresponds with I/O
not being ready or similar so the order of magnitude of TLS operations
isn't too significant compared to the I/O overhead.

Closes bytecodealliance#6493
@alexcrichton alexcrichton requested a review from fitzgen June 2, 2023 16:03
@alexcrichton alexcrichton requested a review from a team as a code owner June 2, 2023 16:03
@alexcrichton
Copy link
Member Author

I'll note that I had a number of false starts in attempting to solve this issue. First I thought that this only needed a push/pop around calls to cabi_realloc in the component model, but that wasn't correct because that only works for import-based calls to realloc, not export-based calls. I then wrote a test to expose the flaw with that strategy.

Next I thought that it would be possible to not actually maintain the entire linked list of activations and instead have a fiber have its own linked list which replaces the current thread's linked list when it resumes and swaps it back out when it suspends. This however does not work because a complete list of activations is required to correctly perform GC, so I wrote a test to expose that bug.

Finally I settled on this solution, which is at least sufficient for all the above scenarios. I'm not 100% sure it's the end-all-be-all, so extra care in review would be appreciated!

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jun 2, 2023
@github-actions
Copy link

github-actions bot commented Jun 2, 2023

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM but I have an idea around making this slightly stronger typed and also some naming nitpicks (take these with a grain of salt, I'm not 100% sure why I hate "TLS" here so much, maybe you can think of something better tho, since I am woefully lacking in terms of actual alternatives)

Comment on lines +415 to 434
/// This will execute the `future` provided to completion and each invocation of
/// `poll` for the future will be executed on a separate thread.
pub async fn execute_across_threads<F>(future: F) -> F::Output
where
F: Future + Send + 'static,
F::Output: Send,
{
let future = PollOnce::new(Box::pin(future)).await;

tokio::task::spawn_blocking(move || tokio::runtime::Handle::current().block_on(future))
.await
.expect("shouldn't panic")
let mut future = Box::pin(future);
loop {
let once = PollOnce::new(future);
let handle = tokio::runtime::Handle::current();
let result = std::thread::spawn(move || handle.block_on(once))
.join()
.unwrap();
match result {
Ok(val) => break val,
Err(f) => future = f,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/store.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/store.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/store.rs Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member Author

I definitely agree with the separation of types and re-wording of things, done now 👍

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

@fitzgen fitzgen added this pull request to the merge queue Jun 2, 2023
Merged via the queue into bytecodealliance:main with commit 550a16f Jun 2, 2023
@alexcrichton alexcrichton deleted the component-async-bug branch June 12, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants