-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857
Comments
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsSeen here: Repro (probably): Build the runtime with
|
Ok, this is actually due to the thread suspend changes in 6726fae In that PR, we added logic to run a second phase of STW on full coop suspend if The problem is that on hybrid suspend (where two-phase STW) was developed, the assumptions were:
In contrast when we added the second suspend phase for The problem is that when we decide to defer the main thread, when we run the second phase, we again run the "cordial" suspend. which again requests suspension of every thread. Which for self-suspended threads (ie well behaving coop threads) ends up incrementing their suspend count a second time. Then when we run "resume" it only decrements the suspension count once, and so:
I'm not sure what the solution is yet. I'm pretty certain this is not a bug in the hybrid suspend idea of two-phase suspend. That is, we haven't been sitting on a bug for 4 years since 2f6379d. This is purely a bug in the recent code that attempted to add a second phase to full coop on WASM. |
Not sure what I was thinking in 6726fae. Coop suspend is really |
Each call to begin_suspend_request_suspension_cordially may increment the suspend count. But in STW we only resume each thread once. In 6726fae we added a second phase of STW to full coop on WebAssembly in order to suspend the browser thread after all the worker threads have been suspended in order to avoid some deadlocks that rely on the main thread continuing to process async work on behalf of the workers before they reach a safepoint. The problem is that for worker threads we could end up calling begin_suspend_request_suspension_cordially twice. If the thread self-suspends after the first call, the second call will increment the suspend count. As a result, when we restart the world, the thread will decrement its suspend count, but still stay suspended. Worse, on the _next_ stw, we will increment the suspend count two more times and decrement once on the next restart, etc. Eventually the thread will overflow the suspend counter and we will assert `!(suspend_count > 0)`. Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the suspend count is signed, so the roll-over from 127 to -128 is where we should assert Fixes dotnet#72857
…r; fix merge (#73305) Grab bag of threading fixes: 1. Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like `sem_wait`, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes #72857 2. Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait. 3. Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms. 4. Fix a bad signature for `cwraps.mono_wasm_event_pipe_enable` due to a mistake in a previous merge 5. Added a new `browser-threads` sample --- * [coop] Don't call begin_suspend_request_suspension_cordially twice Each call to begin_suspend_request_suspension_cordially may increment the suspend count. But in STW we only resume each thread once. In 6726fae we added a second phase of STW to full coop on WebAssembly in order to suspend the browser thread after all the worker threads have been suspended in order to avoid some deadlocks that rely on the main thread continuing to process async work on behalf of the workers before they reach a safepoint. The problem is that for worker threads we could end up calling begin_suspend_request_suspension_cordially twice. If the thread self-suspends after the first call, the second call will increment the suspend count. As a result, when we restart the world, the thread will decrement its suspend count, but still stay suspended. Worse, on the _next_ stw, we will increment the suspend count two more times and decrement once on the next restart, etc. Eventually the thread will overflow the suspend counter and we will assert `!(suspend_count > 0)`. Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the suspend count is signed, so the roll-over from 127 to -128 is where we should assert Fixes #72857 * improve thread state machine assertion messages include thread states and thread ids where available * fix typo * Revert "[coop] Don't call begin_suspend_request_suspension_cordially twice" This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2. * [threads] Revert coop two-phase STW Remove mono_threads_platform_stw_defer_initial_suspend The motivation for it in 6726fae was unfounded. There is no need to suspend the main browser thread after the other threads: suspension on wasm uses `sem_wait` which on Emscripten on the main thread is implemented using a busy wait `__timedwait_cp` which processes queued calls. So even if we suspend the main thread first, it will still allow other threads in GC Safe to make progress if they're using syscalls. * Switch the diagnostics server to GC Safe when returning to JS; set NO_GC flag The diagnostic server worker spends most of its time in the JS event loop waiting for messages. After we attach to the runtime, we need to switch to GC Safe mode because the diagnostic server may not ever reach a safepoint (for example if no more DS events arrive). Conversely, when we call from JS into the C diagnostic server, we need to enter GC Unsafe mode (and potentially safepoint). Also mark the diagnostic server threads with the NO_GC flag - this thread does not manipulate managed objects so it doesn't need to stop for GC STW. * cwraps: fix bad signature for mono_wasm_event_pipe_enable Mistake from a previous merge * Add new browser-threads sample * exclude the browser-threads sample unless wasm threads are enabled * Update browser-threads sample * Update src/mono/mono/component/diagnostics_server.c * Update src/mono/mono/utils/mono-threads-state-machine.c Co-authored-by: Katelyn Gadd <kg@luminance.org>
Seen here:
#72275 (comment)
Repro (probably):
Build the runtime with
/p:WasmEnableThreads=true
; build and run thebrowser-mt-eventpipe
sample.The text was updated successfully, but these errors were encountered: