[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

lambdageek · 2022-07-26T13:29:22Z

Seen here:

#72275 (comment)

Repro (probably):

Build the runtime with /p:WasmEnableThreads=true; build and run the browser-mt-eventpipe sample.

The text was updated successfully, but these errors were encountered:

ghost · 2022-07-26T13:29:52Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Seen here:

#72275 (comment)

Repro (probably):

Build the runtime with /p:WasmEnableThreads=true; build and run the browser-mt-eventpipe sample.

Author:	lambdageek
Assignees:	-
Labels:	`arch-wasm`, `area-Diagnostics-mono`
Milestone:	-

lambdageek · 2022-07-30T01:01:25Z

Ok, this is actually due to the thread suspend changes in 6726fae

In that PR, we added logic to run a second phase of STW on full coop suspend if mono_threads_platform_stw_defer_initial_suspend returns TRUE for some thread.

The problem is that on hybrid suspend (where two-phase STW) was developed, the assumptions were:

The first phase always decides to return MONO_THREAD_BEGIN_SUSPEND_NEXT_PHASE after first doing a "cordial" suspension request. When the cordial suspend request happens, the thread is put into a "suspend requested" state, or if it was already suspended its suspend count is incremented. Only after that do we try to decide if the thread is sufficiently suspended or if we need to run the second phase and preempt the thread.
When we run the second phase, we don't do the "cordial" suspend - we do "peek and preempt". The key thing about "peek and preempt" is that if the thread is already suspended by the time we "peek", we don't increment the suspend count, and we don't try to force it to suspend.

In contrast when we added the second suspend phase for platform_stw_defer_initial_suspend the goal was to avoid putting the main browser thread into the "suspend requested" until after we attempted to suspend all the other threads (so that the main thread could continue running queued syscalls on their behalf).

The problem is that when we decide to defer the main thread, when we run the second phase, we again run the "cordial" suspend. which again requests suspension of every thread. Which for self-suspended threads (ie well behaving coop threads) ends up incrementing their suspend count a second time.

Then when we run "resume" it only decrements the suspension count once, and so:

the victim thread is stuck in suspend limbo (since it's suspend count never returns to zero)
after enough STWs (due to EventPipe sampling, or due to GCs) we end up wrapping the suspend count (it's an 8-bit signed int) which goes126 -> 127 -> -128 and we end up asserting !(suspend_count > 0)

I'm not sure what the solution is yet. ~~Probably we need some version of "peek and preempt" for full coop suspend (so... without "and preempt").~~ "peek, no preempt" won't help us here - that logic depends on the first phase always requesting a suspend of every thread. in contrast here we need to partition the threads - those where we request suspension in the first phase and those where we request suspension in the second phase. and neither partition should be acted on in the "other" phase. It would also be nice if we could avoid adding more state to MonoThreadInfo. Maybe we change mono_threads_platform_stw_defer_initial_suspend to return some kind of a suspend priority instead of a boolean, and run k iterations from 0 to the max seen (logically lowest) priority, where in each iteration i we only suspend threads with priority i. (and for WASM the priorities would be 0 (not the main thread) and 1 (main thread) - and we can compute them cheaply on the fly by calling mono_threads_wasm_is_browser_thread).

I'm pretty certain this is not a bug in the hybrid suspend idea of two-phase suspend. That is, we haven't been sitting on a bug for 4 years since 2f6379d. This is purely a bug in the recent code that attempted to add a second phase to full coop on WASM.

lambdageek · 2022-08-02T16:23:30Z

Not sure what I was thinking in 6726fae. Coop suspend is really mono_os_sem_wait which boils down to __timedwait_cp which processes queued calls. So it should be ok to suspend the main thread in the first phase together with other threads

Each call to begin_suspend_request_suspension_cordially may increment the suspend count. But in STW we only resume each thread once. In 6726fae we added a second phase of STW to full coop on WebAssembly in order to suspend the browser thread after all the worker threads have been suspended in order to avoid some deadlocks that rely on the main thread continuing to process async work on behalf of the workers before they reach a safepoint. The problem is that for worker threads we could end up calling begin_suspend_request_suspension_cordially twice. If the thread self-suspends after the first call, the second call will increment the suspend count. As a result, when we restart the world, the thread will decrement its suspend count, but still stay suspended. Worse, on the _next_ stw, we will increment the suspend count two more times and decrement once on the next restart, etc. Eventually the thread will overflow the suspend counter and we will assert `!(suspend_count > 0)`. Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the suspend count is signed, so the roll-over from 127 to -128 is where we should assert Fixes dotnet#72857

…r; fix merge (#73305) Grab bag of threading fixes: 1. Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like `sem_wait`, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes #72857 2. Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait. 3. Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms. 4. Fix a bad signature for `cwraps.mono_wasm_event_pipe_enable` due to a mistake in a previous merge 5. Added a new `browser-threads` sample --- * [coop] Don't call begin_suspend_request_suspension_cordially twice Each call to begin_suspend_request_suspension_cordially may increment the suspend count. But in STW we only resume each thread once. In 6726fae we added a second phase of STW to full coop on WebAssembly in order to suspend the browser thread after all the worker threads have been suspended in order to avoid some deadlocks that rely on the main thread continuing to process async work on behalf of the workers before they reach a safepoint. The problem is that for worker threads we could end up calling begin_suspend_request_suspension_cordially twice. If the thread self-suspends after the first call, the second call will increment the suspend count. As a result, when we restart the world, the thread will decrement its suspend count, but still stay suspended. Worse, on the _next_ stw, we will increment the suspend count two more times and decrement once on the next restart, etc. Eventually the thread will overflow the suspend counter and we will assert `!(suspend_count > 0)`. Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the suspend count is signed, so the roll-over from 127 to -128 is where we should assert Fixes #72857 * improve thread state machine assertion messages include thread states and thread ids where available * fix typo * Revert "[coop] Don't call begin_suspend_request_suspension_cordially twice" This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2. * [threads] Revert coop two-phase STW Remove mono_threads_platform_stw_defer_initial_suspend The motivation for it in 6726fae was unfounded. There is no need to suspend the main browser thread after the other threads: suspension on wasm uses `sem_wait` which on Emscripten on the main thread is implemented using a busy wait `__timedwait_cp` which processes queued calls. So even if we suspend the main thread first, it will still allow other threads in GC Safe to make progress if they're using syscalls. * Switch the diagnostics server to GC Safe when returning to JS; set NO_GC flag The diagnostic server worker spends most of its time in the JS event loop waiting for messages. After we attach to the runtime, we need to switch to GC Safe mode because the diagnostic server may not ever reach a safepoint (for example if no more DS events arrive). Conversely, when we call from JS into the C diagnostic server, we need to enter GC Unsafe mode (and potentially safepoint). Also mark the diagnostic server threads with the NO_GC flag - this thread does not manipulate managed objects so it doesn't need to stop for GC STW. * cwraps: fix bad signature for mono_wasm_event_pipe_enable Mistake from a previous merge * Add new browser-threads sample * exclude the browser-threads sample unless wasm threads are enabled * Update browser-threads sample * Update src/mono/mono/component/diagnostics_server.c * Update src/mono/mono/utils/mono-threads-state-machine.c Co-authored-by: Katelyn Gadd <kg@luminance.org>

dotnet-issue-labeler bot added the area-Tracing-coreclr label Jul 26, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jul 26, 2022

lambdageek added area-Diagnostics-mono arch-wasm WebAssembly architecture and removed area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jul 26, 2022

lambdageek modified the milestones: 8.0.0, 7.0.0 Jul 26, 2022

lambdageek self-assigned this Jul 30, 2022

lambdageek added area-VM-threading-mono and removed area-Diagnostics-mono labels Jul 30, 2022

lambdageek changed the title ~~[wasm-ep] Sampling thread requests resume for a thread with !(suspend_count > 0)~~ [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) Jul 30, 2022

lambdageek mentioned this issue Aug 3, 2022

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 3, 2022

lambdageek closed this as completed in #73305 Aug 5, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 5, 2022

ghost locked as resolved and limited conversation to collaborators Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

lambdageek commented Jul 26, 2022

ghost commented Jul 26, 2022

lambdageek commented Jul 30, 2022 •

edited

Loading

lambdageek commented Aug 2, 2022

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

Comments

lambdageek commented Jul 26, 2022

ghost commented Jul 26, 2022

lambdageek commented Jul 30, 2022 • edited Loading

lambdageek commented Aug 2, 2022

lambdageek commented Jul 30, 2022 •

edited

Loading