-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access Violation in .NET application (with wasmtime-dotnet
) when using current Wasmtime build
#5340
Comments
Wasmtime.Dotnet
) when using current Wasmtime buildwasmtime-dotnet
) when using current Wasmtime build
Thanks for this! I've read over the source and I don't have a Windows machine on-hand to reproduce with but I've been trying to debug by correlating what's happening in the compiled artifact using the addresses from the backtrace you have. Unfortunately though the image I'm compiling locally seems to have functions at different offsets so I can't get things to line up. Reading over the code though it looks like this has threading involved, possible GC interactions with .NET, and additionally the reproduction you mention is nondeterministic. @peterhuene would you be able to take a look at this? I've encountered similar-ish issues with the Go bindings historically and it was an often an error in the bindings and how the bindings interacted with the GC so I'm curious if anything jumps out at you here. |
Hi @alexcrichton thanks for your reply! That's a good point about the GC, which also applies to the repro project. I modified the code of the repro project to ensure the Wasmtime objects ( The access violation still reproduces for me with these changes. I also tried putting break points in the various Thanks! |
@kpreisser thanks for the report and the additional details. I'll investigate this immediately and follow up when I have new information to share. |
The AV is a result of a wasm frame overflowing the stack, where the frame size exceeds the guard page size (thus it tries to write to unmapped memory). It is passing the overflow check in the prologue. I haven't looked too closely at the cranelift issue, but I suspect it's related (digging into it now). |
Oh! I noticed the stack size increase which now that I think about it I think that may be larger than the default 1MB stack size on Windows? The stack safety in Wasmtime relies on the embedder ensuring there's at least that much wasm stack available so if the new thread doesn't have 2MB+ of wasm stack then I think this could happen. |
It repros with the default wasm stack size too (that is indeed a misconfiguration without also increasing the stack size when creating the thread). |
Ah ok makes sense! This is perhaps a case where Cranelift could help out where we could insert stack probes in addition to the stack check we have today where the probes could guarantee a segfault to deterministically exit. I don't think we'll want to catch this as a stack overflow trap but reporting this with a better error message I think could be useful. |
I stand corrected again. It is still reproing with the default wasm stack size; the AV is inconsistent (probably getting some mapped writable memory beyond the guard page on some runs). I'm digging into why it's not triggering the overflow trap. |
Cranelift has an |
Thank you for the investigation and updates! Regarding the stack size, note that the access violation also occurs if I set the thread stack size to 8 MiB ( Thanks! |
On Windows we need stackprobes anyways to grow the stack. As I understand it on Windows only a small part is committed by default and directly below it is a guard page which if accessed will cause the stack to grow a bit. If you access even a single page below the guard page you will get an AV even though it is still in the part of the memory reserved for the stack. cg_clif also had access violations on windows in some cases which it fixed by enabling inline stack probes. cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/1261#issuecomment-1219307630 |
@bjorn3 thanks for the reminder as I've willfully purged my Windows knowledge these past few years! This is indeed what is happening here; the frame is larger than the commit size in the PE header (4KiB), so it's not triggering a fault on the guard page for Windows to commit more stack pages. We should definitely be forcing a probe on Windows for any frame larger than 4 KiB at a minimum. |
Also, I will note that the default thread stack size for a 64-bit .NET core apphost is 1.5 MiB and at the fault there was still plenty of wasm stack space left (hence why it succeeded the prologue overflow check), it just didn't probe to get Windows to commit more pages. |
Thank you! When I modify the repro to allocate (and immediately free) 6 MiB on the stack in the thread when it is started, e.g. with the following code (note that .NET by default initializes stack memory to zero): private void RunWasmInitializerThread()
{
void AllocStack()
{
Span<byte> bytes = stackalloc byte[6 * 1024 * 1024];
}
AllocStack();
// ... then I'm indeed no longer able to reproduce the access violation. |
FYI, I believe function 311 in the module to be the faulting function as it has thousands of locals (lol: |
Here's a simplified wat that reproduces the problem. It has a function with a frame size of 80192 bytes and that consistently gets
By eliding the host compatibility check in the engine and enabling stack probing:
I think we can confidently say that it's due to a missing probe for these large frame allocations. |
If I understand this correctly, this could be a problem in our application (maybe also Linux) since we don't know how many remaining space is available in the stack when a wasm function is called (because we might call them from events (callbacks) raised e.g. by external plugins or user script code, where an arbitrary number of frames may already be on the stack), so we can't ensure that there is at least the specified max wasm stack size available in the current thread's stack. Does this mean this would generally require stack probing to be enabled in Cranelift also on non-Windows OSes to be on the safe side (to ensure large wasm frames don't accidentally access memory that doesn't belong to the current thread's stack)? |
The bug here for Wasmtime is that on Windows we need to always enable stack probes since it's part of how the system works and we can't escape from that. Stack probes are optional for other hosts but can be useful for detecting situations where not enough native stack was allocated by the host. For designing an embedding what you'll generally want to do is determine some watermark for how much stack host code will consume (both before wasm and as part of a host call called by wasm) and how much stack wasm can consume. You'll then want to make sure that threads executing wasm have access to the sum of those two numbers. If you get the numbers wrong then the application should be guaranteed to safely abort with stack probes enabled, but without stack probes wasm runs a risk of jumping over the guard page. |
👍
Ok, thank you! However, as described we don't really have a way to always guarantee the host doesn't use more than a specific amount of stack, because we don't have complete control over the host code/stack. I gather this means that we will need to enable stack probing also on other OSes in our application, to ensure wasm frames don't have a risk of jumping over the guard page, so that the application is safely aborted in such a case. |
Not yet, no, and you can't do that today due to the issue @jameysharp mentioned above. @peterhuene were you going to work on a patch to enable that? Personally I think it's reasonable to unconditionally enable this for all supported platforms since stack probes shouldn't have much of a perf impact anyway. This is only implemented for x86_64 though and other platforms don't have inline stack probes implemented yet and outlined probing support would need more changes in Wasmtime we may not yet be ready for. |
Yes, I'll create a patch to enable it unconditionally (for x86_64). |
This commit unconditionally enables stack probing for x86_64 targets. On Windows, stack probing is always required because of the way Windows commits stack pages (via guard page access). Fixes bytecodealliance#5340.
I've put #5350 up and verified that it addresses the crash you're seeing with your module. |
Thanks a lot for fixing this! Regarding the stack probing, for us it would be nice to also have inline probing for AArch64 (#4846). |
* wasmtime: enable stack probing for x86_64 targets. This commit unconditionally enables stack probing for x86_64 targets. On Windows, stack probing is always required because of the way Windows commits stack pages (via guard page access). Fixes #5340. * Remove SIMD types from test case.
Indeed! I've opened #5353 to add support for aarch64 |
Hi,
we are using
Wasmtime
viawasmtime-dotnet
in a .NET 6.0 application, where we load a 8 MB WASM file and execute its_start
function. However, this causes an access violation on Windows inwasmtime_func_call
when using Wasmtime from commit ff5abfd.The crash started to happen toegher with issue #5328 (with commit fc62d4a), which is why I also reported it there, but it could be that this was just a trigger and the actual root case is a different one. It could also be that once #5328 is fixed, the access violation is no longer reproducible.
I can reproduce the access violation with the following steps on Windows 10 x64 Build 19044, and on Windows Server 2019:
x86_64-pc-windows-msvc
) as release build:cargo build --release --manifest-path crates/c-api/Cargo.toml
repro-wasmtime-access-violation
, which contains a commit that adds the repro project on top of the currentwasmtime-dotnet
code.(The code contains a lot of fields/variables and some methods that are not used, but that was as best as I could get it while still reproducing the crash.)
cd
into the repo'sReproWasmtimeAccessViolation
folder, then rundotnet build ReproWasmtimeAccessViolation.csproj
.cd bin\Debug\net6.0
, then copywasmtime.dll
andwasmtime.pdb
from the build result into this directory.dotnet-codabix-wasm-build.wasm
also into this directory.ReproWasmtimeAccessViolation.exe
.After some seconds, the program will exit if the access violation occured (and in Windows Error Reporting, there will be an error entry in the Application log ("Application Error").
If it didn't occur, the program will print
"wasmtime_func_call completed! This should not be displayed if the Access Violation occured."
, and you can useCtrl
+C
to exit.Unfortunately, the crash doesn't alway occur, it might also depend on the hardware or other factors. You may have to try a few times if the crash doesn't occur, or try again at a later time.
When attaching Visual Studio 2022 as debugger (using "native code" mode), it will show the following:
Call Stack:
Additional notes:
Environments where I can reproduce the crash:
Thank you!
The text was updated successfully, but these errors were encountered: