Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we relax Atomics.wait to allow waiting on the main thread? #177

Open
juj opened this issue Apr 13, 2021 · 23 comments
Open

Can we relax Atomics.wait to allow waiting on the main thread? #177

juj opened this issue Apr 13, 2021 · 23 comments

Comments

@juj
Copy link

juj commented Apr 13, 2021

In multithreaded applications, there is the .wait() primitive that allows one to synchronously wait on Worker threads. Main thread is disallowed from .wait()ing, since that is a blocking operation. The effect of this is that main thread is not able to participate in shared memory state like Workers are. In order to remedy this, there is the .waitAsync() primitive that is intended to fill that gap for the main thread.

In summary, there exist the following interactions that are possible to interact with a lock or a CAS-variable:

  1. try-lock and abort (i.e. poll CAS once or a few times, and do something else/yield if not successful, try again later)
  2. infinite try-lock (busy spin until successful)
  3. Atomics.wait (sleep the calling thread until next CAS attempt can be made)
  4. Atomics.waitAsync() (enqueue an even when a next CAS attempt can be made)

Emscripten currently implements the pthread API for multithreading. That API does not unfortunately lend itself to using 4) .waitAsync() above, but it can express 1)-3). Main thread is limited from being able to do 3), so is left with 1) and 2). In many applications problem space, option 1) is not meaningful, so that leaves option 2) as the only way to proceed.

In order to fix the issue that pthreads does not allow one to express .waitAsync()s, we have been looking at both extending the pthreads API with a new pthread_mutex_lock_async() function family, but also creating a new "web-style" Wasm Workers multithreading API that would be designed ground up for the offered web primitives in mind.

In #176 we are discussing some of the technicalities of Atomics.waitAsync() that have come up that have prevented its adoption.

However it is looking very clear that even if/when pthread_mutex_lock_async(), Wasm Workers and #176 are resolved, there will still exist a lot of code that cannot meaningfully be recast in a Atomics.waitAsync() pattern, and they will need to continue to busy-spin access their locks ( 2) above). In most scenarios where the main threads of these applications busy-spin access the locks, they do so in scenarios where most of the time (if not practically always), the contention is zero, so the lock is practically always obtained very quickly. Or they might have scenarios where there can be a lot of contention, but the contention is expected to be very short-lived (a multithreaded malloc() or filesystem access being prime examples).

So these applications do busyspinning, but however, they must do that with a downside: currently main thread is prevented from being able to .wait() for a lock, no matter how short-lived the expected wait time would be.

This restriction, however well-spirited to nudge developers to look towards writing their code to be .waitAsync()-based, seems to be hurting instead: instead of saving performance and responsivenss, the programs instead need to resort to busy-spin-waiting and potentially consuming more battery - an opposite result that was intended.

That raises the question for conversation: would it be possible to lift the restriction that Atomics.wait() cannot wait on the main thread?

The wait would be blocking, but the same application hang watchdog timers would apply. I.e. wait for 10 seconds and the slow script dialog would come up.

Or maybe the max wait period on the main thread would be reduced, to e.g. 1 second or 0.5 seconds, or similar (if it helps implementing a slow script watchdog in some browsers?)

What this would win is that applications that do need to busy-spin on the main thread would be able to actually save battery while doing so, instead of consuming more excess cycles.

The worry that enabling wait on the main thread would invite more use of blocking on the main thread does not seem correct. Applications already need to wait on the main thread for some uses - malloc being a prime example - and it could happen either with a proper sleep construct in place, or without.

If there existed support for waiting on the main thread, browser DevTools would actually be able to detect and highlight this spent time specifically, and be able to show in profilers and DevTools timelines where such sleeps took place. Now without such support, those wait times are lost in an application-specific busy loop.

Also if it was possible to wait on the main thread, the browser could be more aware when it is intervening, and the slow script dialog would be able to highlight that this hang is due to a multithreaded programming hang, which would directly hint towards a programming error, and the programmer to look into their shared data structures usage. In the current state since there is no main thread wait support, when these programming errors come up, people may be unaware of which direction to look at first.

To summarize, the benefit of limiting Atomics.wait() from the main thread seems harmful, since in the needed problem spaces, those sleeps will get replaced with busy for(;;) loops instead. We would rather give the main thread a breather, be able to detect and highlight in DevTools where synchronization related waits occur, and improve battery usage.

What do you think?

@RReverser
Copy link
Member

Is this more or less continuation of #106? If so, might be better to reopen that one since it already had lots of interesting discussion around lifting this restriction.

Anyway, one more option that wasn't described here is emscripten-core/emscripten#9910 (but applied more broadly) - that is, using Asyncify + Atomics.waitAsync to simulate blocking behaviour in a more efficient manner. This way, you'd never be blocking the actual main thread / UI work with either Atomics.wait or a spin-loop, yet you'd be able to preserve backward-compatible behaviour from Wasm code's point of view.

Sure, today Asyncify has code size overhead, but this should go away with future proposals like coroutines or other experiments.

Note: I agree just allowing Atomics.wait, or even duplicating current Chrome's behaviour (#174) to other browsers would make backward compat the easiest, but I'm worried that going down this route opens potential for abuse and will make it impossible for us to forbid blocking on the main thread again in the future when a more efficient alternative exists, while using Asyncify + waitAsync gives a good middle ground.

@juj
Copy link
Author

juj commented May 8, 2021

Is this more or less continuation of #106? If so, might be better to reopen that one since it already had lots of interesting discussion around lifting this restriction.

Yes it is, but I do not have the permissions to reopen that issue.

Anyway, one more option that wasn't described here is emscripten-core/emscripten#9910 (but applied more broadly) - that is, using Asyncify

Unfortunately Asyncify is not a production quality solution. It increases build times, drastically blows up code size, reduces runtime performance, introduces extremely hard to debug issues (re-entrancy, unexpected event execution reordering), and requires making difficult analysis about a codebase to figure out what functions need to be asyncified when automatic analysis does not find it (usually due to runtime dynamic dispatch). It is not feasible for large codebases like e.g. Unity to ship with Asyncify enabled. (I don't intend to be critical towards the development of Asyncify, I've followed Alon's amazing work on the feature very close for several years. It can work great for smaller ported projects, but it unfortunately does not scale to production)

I'm worried that going down this route opens potential for abuse and will make it impossible for us to forbid blocking on the main thread again in the future when a more efficient alternative exists

This hypothesis behind this argument does not seem convincing. The scenario that we have have does not seem like a "I'll sleep-wait on main thread" vs "I'll write 'good' asynchronous code" scenario, but rather a "I'll sleep-wait on main thread" vs "I'll busy-spin on the main thread because nothing else is feasible" scenario.

The wording "potential for abuse" reads like a concern of "people will get lazy and write 'bad practices' synchronous code" - the similar rationale that one really often reads about general all-sync-is-bad vs async-is-great JS API discussions. That argument gets somehow put forth into these sync-vs-async JS conversation, my claim here is that this is an incorrect hypothesis: the real problem is not a result of the developer's code structure design, or abuse, or laziness, but a fundamental looking result of how shared-state-multithreaded solutions to common problems need to operate.

In other words, it is dominantly the problem that one is set out to solve that dictates the type of structure that the solution will need to have, and not a general "oh I just happened to design the program structure like this or that"/"I was lazy"/"it was simplest this way".

For most multithreaded shared state based problems, we do not know of any feasible program structures/algorithms that would work for this kind of "every-malloc-is-async", "every-shared-memory-access-is-async" world - and because of async contagion/propagation, it is not just the problem itself that dictates the program structure but also the enclosing application logic. If we do multithreaded async malloc or async Atomics.wait, the result will be akin to having practically every function in the JavaScript program be async. We certainly don't encourage developers to make every single JS function async, and it will not scale well to make every single function a wasm coroutine either.

Someone will need to come up with a new paradigm for multithreaded shared state programming and embedding such programs to bigger programs in order for Atomics.asyncWait() to take off and scale to larger applications. I doubt such a paradigm exists, and we'll see Atomics.asyncWait() only show up in small examples, but in large programs, its applicability tends to zero very quickly.

It does not either seem to be a case of "well your codebase is a legacy one that you ported to the web instead of ground up wrote for the web", but even now with the wasm workers API, attempting to design a ground up new codebase for async shared state multithreaded execution on the web, falls flat very quickly.

In the absence of a sync sleep on the main thread, I foresee one of two results:

a) applications will busy-spin on the main thread to do synchronous locking, wasting CPU power, or
b) applications will push their execution onto a Worker, leaving main thread idle, but then have to fight against major impact caused by several web APIs not being available in Workers.

Emscripten has offered both mechanisms for half a decade now, and the trend we see is that b) is not taking off, largely due to the major PITA of the limitations one needs to adhere to, and hence a) has become the favorite solution.

Hence I would propose the most tenable solution to be

c) applications will Atomics.sleep() on the main thread, since they know their data and thread access patterns, to conclude that thread contention is minimized to not hang the main thread responsiveness.

This will save battery on those occassions where a busyspin wait turns into a 100msec loop since a worker thread to just happened to touch the same shared state structure at the same time.

in the future when a more efficient alternative exists, while using Asyncify + waitAsync gives a good middle ground.

On paper it might read like a good temp bandaid, but like mentioned, I do not see ASYNCIFY scaling at all to production. Also, the more efficient alternative being wasm coroutines - that will not be a solution for this problem either. We cannot make all shared state accessing functions contagiously coroutines either.

@RReverser
Copy link
Member

The wording "potential for abuse" reads like a concern of "people will get lazy and write 'bad practices' synchronous code"

It's not - my concern is "people will unknowingly cross-compile code that does that in native apps to the Web and won't even realize that the same pattern of blocking the main thread is much more harmful in the browser". That's what's already happening today in Emscripten using the busy-spin loop, and I very much want to make sure we can remove that in the future to avoid blocking altogether, and a proposal to allow synchronous Atomics.wait on the main thread pushes away from chance of that happening even further.

Emscripten has offered both mechanisms for half a decade now, and the trend we see

I'd largely explain that by the same reasons described above - many native devs aren't even aware of Web's execution model / limitations, or how their code is compiled to the Web. Emscripten does a great job papering over differences between platforms to make sure things "just work", but we can and should warn developers when their code blocks on the main thread and to promote better patterns instead. Emscripten already prints some warnings here, but we can still do more.

Also, the more efficient alternative being wasm coroutines

Or general stack-switching, but yeah. There are also some alternatives I'm currently exploring that can work even in today's code - this area is not too explored beyond current Asyncify implementation mainly because some required APIs (like waitAsync) only finally start appearing in browsers while others are in proposal phases, so I wouldn't write off more efficient Asyncify implementation just because it hasn't happened yet. There is work being done that brings us closer in that direction.

@juj
Copy link
Author

juj commented May 9, 2021

That's what's already happening today in Emscripten using the busy-spin loop, and I very much want to make sure we can remove that in the future to avoid blocking altogether

That is exactly what I am saying with a) in the previous post. But your argument is circular, you are saying "because people are having the problem today and try to work around it, it proves we should not be fixing the problem."

It is also putting forth that incorrect hypothesis that this would be a mere "programming practices"/"legacy code" issue and that Emscripten is bad since it promotes bad practices instead of encouraging good practices. Like I tried to mention before, there is nothing that makes this look like a programming practices issue, but instead, the issue is that people have these existing multithreaded algorithms that we do not have any model to offer for on the web that would work under "good practices". The good practices do not exist - or the good practice is a mere "don't do that", without an answer of what to do then instead.

Algorithms are timeless. The fact that someone wrote an implementation of a multithreaded algorithm two decades ago does not make it any bad or legacy, but not being able to compile it to the web just says that web programming wants to prohibit that algorithm from being feasible to be used on the web. There do not seem to be any good reasons for this.

I invite you to solve the contagiously async access to a shared data structures problem (e.g. a multithreaded malloc) with these yet-to-be-uncovered good practices, in a way that is composable to scale to large applications. None of waitAsync, asyncify, coroutines or even stack switching (if I understand that correctly) are even close to being a good solution.

a proposal to allow synchronous Atomics.wait on the main thread pushes away from chance of that happening even further.

This line of thinking is conceding that we do not have a solution, and proposing that we should not provide a solution either, but instead keep it a research problem until maybe someone comes up with something magical that likely does not exist.

There are also some alternatives I'm currently exploring that can work even in today's code

What are these alternatives? Can you be more specific?

  • this area is not too explored beyond current Asyncify implementation mainly because some required APIs (like waitAsync) only finally start appearing in browsers

I believe it is explored to the point that we are seeing the faults and limits of waitAsync quite well, and we can see the amount that it limits being able to port access to multithreaded data structures in a way that scales.

while others are in proposal phases

Can you be more specific?

so I wouldn't write off more efficient Asyncify implementation just because it hasn't happened yet.

Can you be more specific on what can improve the Asyncify implementation, and what kind of performance increase can be expected?

Or thinking in terms of alternatives, if you have a choice between busyspin on the main thread vs asyncify, and the merits are:

busyspin:

  • build size does not regress,
  • build times do not regress,
  • programming model is well defined and easy to reason with,
  • 95% of the time performance is perfect when there is no contention to your data structure,
  • 4% of the time you take a 10ms wasted CPU cycles hit when there is contention,
  • 1% of the time you take a 100ms wasted CPU cycles hit when there is a lot of contention
    (or similar, developers do know and are able to profile their data structures to come up with these kinds of statistics, e.g. in multithreaded malloc case)
  • All the existing multithreaded programming literature exists that allow one to understand contention - it is a very well known paradigm, and very well taught paradigm, knowledge is widespread among developers

asyncify:

  • code size regresses by 2x or more, (or spend a lot of time doing difficult program analysis to figure out what your dynamic program flow is, or limit/cripple the ability to do dynamic program flow for programmers)
  • build times regress,
  • runtime performance regresses by 20%-500% 100% of the time,
  • difficult to debug bugs arise from re-entrancy and reordered event handling that do not follow a well defined programming paradigm, requires expert programmer intervention to all over the codebase

It is not hard to see how obviously right choice it is to busy spin to get objectively best behaving application. There will be no amount of optimizing asyncify that can perform better than a busyspin here, even if we did just look at the CPU overhead alone.

Emscripten does a great job papering over differences between platforms to make sure things "just work", but we can and should warn developers when their code blocks on the main thread and to promote better patterns instead. Emscripten already prints some warnings here, but we can still do more.

Again, trying to state that "well you should write multithreaded code that does not need to block in the first place" is like a fantasy: there is nothing that shows that we can build multithreaded applications to large scale that way. Adding a couple of warnings will not do anything, people will just post on the forums asking "how do I fix this Emscripten warning?", and we don't have anything else to say to them except to "well, rewrite all your multithreaded data structure accesses to a programming paradigm that does not even exist".

many native devs aren't even aware of Web's execution model / limitations, or how their code is compiled to the Web.

It is not particularly relevant to consider third-party developer lack of education, when not even first-party developers are able to write scalable code under this model.

Also, if you posit that most developers will compile existing code without knowing how it is compiled to the web, then that is actually a great point against your other concern of "a proposal to allow synchronous Atomics.wait on the main thread pushes away from chance of that happening even further", since if the case was that people are just blindly compiling with what Emscripten gives to them, then it should not be a big problem for Emscripten authors to swap it from using a sync Atomics.wait() to something else when/if a better primitive comes along, and everyone will win.

@RReverser
Copy link
Member

I'm on my phone on a weekend, so I won't be responding to individual questions right now, but the beginning seems suspicious as if we're still talking about the same thing.

there is nothing that makes this look like a programming practices issue, but instead, the issue is that people have these existing multithreaded algorithms

It's literally what I said in my last comment. I'm not sure why you're assuming that I'm talking about either malice or incompetence on programmer side - and then you respond to that argument - when what I've said and clarified again is that the issue has nothing to do with that and is just inherent to mapping native code to the Web execution model.

I've never said that programmers are wrong or that "Emscripten is bad" so those points seem moot for the purposes of the discussion.

@juj
Copy link
Author

juj commented May 9, 2021

Thank you for clarifying that - especially during a weekend.

@kettle11
Copy link

kettle11 commented Dec 1, 2023

A few years later this is still a pertinent issue.

In many cases native code is designed to wait extremely briefly on the main thread and in places where that code is ported to Wasm the only 'reasonable' solution is to spin instead of wait.

For example this spin has remained in the Rust standard library for 5+ years now: https://github.com/rust-lang/rust/blob/f45631b10f2199a7ef344a936af9eb60342c16ee/library/std/src/sys/wasm/alloc.rs#L71

The intention of preventing waits on the main thread is to discourage bad behavior but many projects (like Rust's stdlib) will instead spin in a loop, which is worse behavior!

And, at least in the Rust world, many projects will still compile with wait instructions and it only becomes clear when the project is run and crashes, due to a wait on the main thread, that something is wrong. It makes the process of porting multi-threaded Rust code far more involved because a variety of libraries must be audited and reworked.

From my perspective preventing wait on the main thread has accomplished two things:

  • Encouraged use of spinning in a tight loop
  • Significantly slowed the adoption of multi-threaded WebAssembly (via workers)

@tlively
Copy link
Member

tlively commented Dec 1, 2023

Emscripten has had spinning waits on the main thread for as long as it has supported threads, too.

Unfortunately there's no sign that the Web Platform folks will let us block the main thread any time soon, and if there were a way they could prevent us from busy waiting, I'm sure they would like to do that as well.

Thankfully, a potential better solution is on the horizon. JavaScript Promise Integration (JSPI) will let us use Atomics.waitAsync to give the appearance of a synchronous wait on the main thread while actually returning to the event loop. This will cause new problems with re-entrancy that toolchains will have to solve because it allows other code to run while the C/C++/Rust/Other program expects to be blocking for real, but if we can solve those problems, at least we will be able to stop busy waiting. Here's a JSPI issue about this: WebAssembly/js-promise-integration#20

I'm surprised Rust projects try to execute wait on the main thread, though. In Emscripten wait instructions are only executed by system libraries that know to do something else on the main thread. Could the Rust situation be improved by changing how their system libraries execute wait?

@juj
Copy link
Author

juj commented Dec 1, 2023

I have tried to argue for a long time that if the main thread was allowed to wait on a futex, then browsers would have an immediate mechanism to detect if the main thread is stalled, since they could easily analyze if the main thread execution is paused inside Atomics.wait.

But now, since that mechanism is not allowed to the main thread, all programs do need to implement a for(;;) busy loop in its place, which hides from the browser a semantical understanding to know "the main thread is waiting for another thread".

In the first case browsers could impose a max timeout limit on the waits on the main thread, and force them e.g. to wait max 5 or 10 seconds (the same watchdog timeout as usual - which should be plenty of time to acquire a mutex, especially if Emscripten did their mutex implementations in a way that favored the main thread as a waiter over all other threads), and if they wait for longer, the browser could easily time out the wait and diagnose "this site is misbehaving" to the user and the developer. All this in a way that was power consumption friendly.

But instead, now we are in a situation that sites routinely have CPU-burning for(;;)s in them.

Atomics.waitAsync is infeasible to use in scale (#176), unfortunately JSPI won't be able to help that.

@kettle11
Copy link

kettle11 commented Dec 9, 2023

I'm surprised Rust projects try to execute wait on the main thread, though. In Emscripten wait instructions are only executed by system libraries that know to do something else on the main thread. Could the Rust situation be improved by changing how their system libraries execute wait?

An example I ran into the other data: the Rust pattern of parallel iterators. Basically if your code read-only iterates over a bunch of data the iterations can be split up into tasks and the work distributed across cores. The way it works is the work is split up into 'tasks' that are assigned to cores by a task manager, but in the meantime the main thread blocks.

This works great because the library is trivial to use:

fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter() // <-- just change that!
         .map(|&i| i * i)
         .sum()
}

In the above example 'iter' is replaced with 'par_iter' and it's automatically made parallel. If you call this from the main thread it magically seems like your iteration completes X times quicker (based on how many cores you have). Technically under the hood the main thread is waiting, but it actually results in the main thread blocked for less time than it would otherwise.

Rust's prominent game-engine Bevy doesn't use the exact library I linked but they use their own implementation of this idea all over the place to speed up a ton of loops. To avoid ever blocking the main thread the parallel iteration needs to be disabled (the main thread will be blocked more!) or significant architectural changes will be needed.

@tlively
Copy link
Member

tlively commented Dec 9, 2023

Right, but I'm surprised that some very low level system library in the Rust toolchain doesn't just busy wait if it detects that it's on the main thread to prevent higher-level libraries like the one you're describing from ever observing the trap from waiting on the main thread.

@RReverser
Copy link
Member

I'm surprised that some very low level system library in the Rust toolchain doesn't just busy wait if it detects that it's on the main thread

All those libs just use stdlib primitives like std::sync::Mutex and such. As for Rust stdlib, it was brought up in the past, citing Emscripten as an example, but they were strongly against busy-waiting even as a workaround for this issue.

@RReverser
Copy link
Member

RReverser commented Dec 9, 2023

At this point, it makes me wonder if it could be a post-link feature in wasm-ld or wasm-opt where it would rewrite all atomics.wait to such conditional busy-loop automatically regardless of the source language.

@kettle11
Copy link

kettle11 commented Dec 9, 2023

At this point, it makes me wonder if it could be a post-link feature in wasm-ld or wasm-opt where it would rewrite all atomics.wait to such conditional busy-loop automatically regardless of the source language.

@RReverser That's an interesting idea!

@RReverser
Copy link
Member

RReverser commented Dec 25, 2023

An example I ran into the other data: the Rust pattern of parallel iterators. Basically if your code read-only iterates over a bunch of data the iterations can be split up into tasks and the work distributed across cores. The way it works is the work is split up into 'tasks' that are assigned to cores by a task manager, but in the meantime the main thread blocks.

Made a PR to Rayon meanwhile, I hope at least this usecase will be simplified if it's accepted: rayon-rs/rayon#1110

RReverser added a commit to RReverser/rayon that referenced this issue Dec 25, 2023
One of the most common complaints I've been receiving in [wasm-bindgen-rayon](https://github.com/RReverser/wasm-bindgen-rayon) that prevents people from using Rayon on the Web is the complexity of manually splitting up the code that uses Rayon into a Web Worker from code that drives the UI.

It requires custom message passing for proxying between two threads (Workers), which, admittedly, feels particularly silly when using a tool that is meant to simplify working with threads for you.

This all stems from a [Wasm limitation](WebAssembly/threads#177) that disallows `atomic.wait` on the main browser thread. In theory, it's a reasonable limitation, since blocking main thread on the web is more problematic than on other platforms as it blocks web app's UI from being responsive altogether, and because there is no limit on how long atomic wait can block. In practice, however, it causes enough issues for users that various toolchains - even Emscripten - work around this issue by spin-locking when on the main thread.

Rust / wasm-bindgen decided not to adopt the same workaround, following general Wasm limitation, which is also a fair stance for general implementation of `Mutex` and other blocking primitives, but I believe Rayon usecase is quite different. Code using parallel iterators is almost always guaranteed to run for less or, worst-case, ~same time as code using regular iterators, so it doesn't make sense to "punish" Rayon users and prevent them from being able to use parallel iterators on the main thread when it will lead to a _more_ responsive UI than using regular iterators.

This PR adds a `cfg`-conditional dependency on [wasm_sync](https://docs.rs/wasm_sync/latest/wasm_sync/) that automatically switches to allowed spin-based `Mutex` and `Condvar` when it detects it's running on the main thread, and to regular `std::sync` based implementation otherwise, thus avoiding the `atomics.wait` error. This dependency will only be added when building for `wasm32-unknown-unknown` - that is, not affecting WASI and Emscripten users - and only when building with `-C target-feature=+atomics`, so not affecting users who rely on Rayon's single-threaded fallback mode either. I hope this kind of very limited override will be acceptable as it makes it much easier to use Rayon on the web.

When this is merged, I'll be able to leverage it in wasm-bindgen-rayon and [significantly simplify](https://github.com/RReverser/wasm-bindgen-rayon/compare/main...RReverser:wasm-bindgen-rayon:wasm-sync?expand=1) demos, tests and docs by avoiding that extra Worker machinery (e.g. see `demo/wasm-worker.js` and `demo/index.js` merged into single simple JS file in the linked diff).
RReverser added a commit to RReverser/rayon that referenced this issue Jan 10, 2024
One of the most common complaints I've been receiving in [wasm-bindgen-rayon](https://github.com/RReverser/wasm-bindgen-rayon) that prevents people from using Rayon on the Web is the complexity of manually splitting up the code that uses Rayon into a Web Worker from code that drives the UI.

It requires custom message passing for proxying between two threads (Workers), which, admittedly, feels particularly silly when using a tool that is meant to simplify working with threads for you.

This all stems from a [Wasm limitation](WebAssembly/threads#177) that disallows `atomic.wait` on the main browser thread. In theory, it's a reasonable limitation, since blocking main thread on the web is more problematic than on other platforms as it blocks web app's UI from being responsive altogether, and because there is no limit on how long atomic wait can block. In practice, however, it causes enough issues for users that various toolchains - even Emscripten - work around this issue by spin-locking when on the main thread.

Rust / wasm-bindgen decided not to adopt the same workaround, following general Wasm limitation, which is also a fair stance for general implementation of `Mutex` and other blocking primitives, but I believe Rayon usecase is quite different. Code using parallel iterators is almost always guaranteed to run for less or, worst-case, ~same time as code using regular iterators, so it doesn't make sense to "punish" Rayon users and prevent them from being able to use parallel iterators on the main thread when it will lead to a _more_ responsive UI than using regular iterators.

This PR adds a `cfg`-conditional dependency on [wasm_sync](https://docs.rs/wasm_sync/latest/wasm_sync/) that automatically switches to allowed spin-based `Mutex` and `Condvar` when it detects it's running on the main thread, and to regular `std::sync` based implementation otherwise, thus avoiding the `atomics.wait` error. This dependency will only be added when building for `wasm32-unknown-unknown` - that is, not affecting WASI and Emscripten users - and only when building with `-C target-feature=+atomics`, so not affecting users who rely on Rayon's single-threaded fallback mode either. I hope this kind of very limited override will be acceptable as it makes it much easier to use Rayon on the web.

When this is merged, I'll be able to leverage it in wasm-bindgen-rayon and [significantly simplify](https://github.com/RReverser/wasm-bindgen-rayon/compare/main...RReverser:wasm-bindgen-rayon:wasm-sync?expand=1) demos, tests and docs by avoiding that extra Worker machinery (e.g. see `demo/wasm-worker.js` and `demo/index.js` merged into single simple JS file in the linked diff).
@syg
Copy link

syg commented Jan 10, 2024

I will be proposing https://github.com/syg/proposal-atomics-microwait

@kettle11
Copy link

I'm frustrated with the general consensus that allowing Atomics.wait is "impossible" and "never going to happen".

Who do we have to convince? Which browser people would block that change?

I understand the argument that it's bad when the main thread blocks (it hangs the UI) and it can be easy to deadlock (calls from workers can depend on the main thread) but here we are 8 years on from the initial discussion and the status quo is worse than allowing it.

Emscripten is busy-looping and the Rust Wasm ecosystem is a mess of hacks, busy-loops, and libraries that assumes Wasm will never have threads. Busy looping is strictly worse than waiting.

I'm in the process of porting a major Rust project (Bevy) to support 'threading' on Wasm and a huge number of compile time if-defs will be needed to special-case behavior on the web and busy-loops will be added. I will have to audit every dependency to make sure they do nothing to block on the main thread.

In retrospect the Rust ecosystem should have followed Emscripten's lead and found a way to busy-loop on the main thread in its low level primitives, but they went for a more theoretically principled approach.

To rant a little bit: this decision, and the way it's been discussed through the years, seems to reflect an almost demeaning approach towards Wasm devs. It's a bit like a parent saying "You can do your stupid thing, but I'm not going to help you". But in this case withholding help has resulted in countless engineering hours and CPU cycles wasted and an impaired Wasm ecosystem, all without actually changing behavior in the way the 'parents' wanted.

@RReverser
Copy link
Member

Made a PR to Rayon meanwhile, I hope at least this usecase will be simplified if it's accepted: rayon-rs/rayon#1110

This is now released in Rayon and available via wasm-bindgen-rayon 1.2.0. @kettle11 you might want to give it a try.

@juj
Copy link
Author

juj commented Mar 6, 2024

Coming back to this topic, to be frank, as more time passes, the stronger I feel that adding atomics waits to the main thread would be the right thing to do.

I am yet to see a positive result that would have followed from not supporting atomic waits on the main thread, but there are several downsides that keep piling up. Enabling atomic waits on main thread would only see positives from my perspective.

This became quite a long post, but here goes. The rationale I find is as follows:

1. Lack of sync waits on main thread is not able to coax users to develop "better" code

The main intent of disallowing atomic waits on the main thread has been to enforce the asynchronous computing manifesto on the JS main thread. I.e. to provide a programming paradigm that only supports asynchronous primitives to keep the main thread of the browser responsive. The intent is that this would bump developers towards writing good "best practices" asynchronous code that is responsive to the user.

However, the thing here is that this goal in context of sync atomic waits is an illusion, in my opinion.

We have now for almost a decade had multithreading and atomics support in Emscripten, with a wide array of developers writing code to target it. The original idea with Atomics.waitAsync() was that, with time, researchers would come up with "best practices" async wait variants of shared data synchronization algorithms that could be used to fill this need.

I have seen a lot activity around developers adopting Emscripten pthreads and Wasm Workers, but I am yet to see good examples of codebases migrating their use of sync waits to an async wait paradigm.

The realization I have since then is that it is not a fundamental problem of shared synchronization algorithms themselves, but it is call stack nesting problem that deals with the general sync-async transformation, i.e. what is on your callstack from "other code higher up" that leads to calling into an async wait path. So migration to such async sharing code simply isn't feasible for many codebase contexts (even if a wealth of such async programming constructs were available).

As a recent example (that prompted me to write down these thoughts) is that I have been implementing a multithreaded stop-the-world mark-and-sweep garbage collector that I am experimenting with as a GC for compiled C# VM on the web. While developing that, I run into instances of mutex lock constructs that deteriorate into CPU-burning spinwaits on the main thread (e.g. [1], [2]), just like on so many other multithreaded Wasm codebases that I have worked on before.

In practically all cases, I am of the opinion that these code paths would be strictly better if main thread had sync wait support. Let's take the most common scenario as an example: locks.

2. Lack of sync waits leads to worse failure modes in basic scenarios

The overwhelming majority of uses of atomic waits are to synchronize access to shared data structures, i.e. to implement locks.

These data structures can come in more or less arbitrary low level places in a multithreaded program, which prevents using Atomics.waitAsync(), since not just the multithreaded code, but also the whole call stack scope leading to the code would need to be restructured. As a result, developers do the only thing that they can: they replace the atomic waits on the main thread with spinlocks.

Developers do not do so because they would be lazily writing poor code; but instead, the Wasm computing platform does not provide good means of recasting such code structures into async waits (I'll discuss how the only general tool, JSPI/await fall short below).

Analyzing the example case of implementation of mutex locks, we can look at three scenarios: a) there is no contention and the code is correct, b) there is contention between workers and main thread, and c) the programmer's code is incorrect (or, more or less meaning that a worker might stall while holding a lock that the main thread would need to synchronously acquire)

c) relates to the danger that we want to avoid spreading on the web, i.e. we don't want developers with bad code shipping web pages.

With these three scenarios, we can fill the following table:

a) correct code, no contention b) correct code, with contention c) incorrect code
main thread does not support atomic.waits minimal CPU use excess CPU use 100% CPU burn
main thread supports atomic.waits minimal CPU use minimal CPU use minimal CPU use

In the case of a) correct code and absence of contention, it does not much matter if sync atomic.waits exist on main thread or not. The differences occur in other two scenarios:

b) In the absence of support for atomic.waits, in the case of contention, the main thread will need to resort to spinning the CPU hot while a worker is accessing the shared data structure. This will be just wasted CPU cycles, whereas if atomic.waits were supported, all that energy could be saved.

c) If there is a programming error, or some other fault in a Worker, and the Worker practically dies/halts while holding a lock, and the main thread then arrives to attempt to get the lock: in the absence of atomic.wait support, this means that the user system will temporarily (for 10-15 seconds) spin up to waste 100% of CPU usage until the browser "page is hung" watchdog kicks in.

Even worse, today in all browsers, this watchdog stops the page execution only when interactively stopped by the user (which I think is the correct general UX). But an unattended buggy web page would result in continuously 100% spinning a CPU core until a human operator intervenes.

If main thread supported sync waits, then all this time that the page is hung could be spent with the system idle, and the watchdog could still arrive equally well to stop the page. Battery is saved, and as a bonus, the browser DevTools would be able to easily diagnose to the developer that the page hang was due to a multithreaded synchronization problem.

Today since browsers do not support sync waits, mutex locks degenerate into for(;;) ; loops that the browser cannot reason about. If these hang due to a deadlock, they will degenerate into battery-munching loops until the watchdog kicks in.

I would strongly argue that canonical mutex lock code would have no categorical problems at running on the main thread: the overwhelming majority of developers writing multithreaded code are familiar and meticulous about the operations they perform in their shared section; and the above table shows that browsers would be better equipped to handle developer failures if the main thread would support sync waiting.

3. Lack of sync waits leads to requiring the developers write deviating code to run on main thread vs workers

A common source of problems with the lack of sync waiting on the main thread is that it leads developers needing to add if (IS_MAIN_THREAD) checks in their code to make sure they go down the specific main thread vs worker thread synchronization paths. This is error prone, slows down performance in tight synchronized loops (such checks require developing a TLS slot access inside each such inner loop), complex to maintain (due to the inherent interaction with Wasm<->JS required to set such checks up) and one more source of platform specific discrepancies.

I would argue that in the past decade, controlling this aspect has caused more programming headache and programming errors, compared to unresponsive multithreaded code reaching production that would stem from the use of "bad" spinlocks.

If we allowed synchronous waits on the main thread, this source of programming problems would go away, and the TLS access in each inner loop of each mutex lock in a program would be avoided.

4. JSPI and await fall flat as a solution

The recent "last hope" for sync waiting on the main thread for Wasm has been with JavaScript Promise integration. JSPI essentially brings the await keyword into Wasm, i.e. allowing await Atomics.waitAsync() constructs to be turned into synchronous looking constructs, which then yield out and resume afterwards.

However, unfortunately JSPI/await are not able to solve this need for sync waiting on the web at least for some use cases.

The big problem with JSPI/await as a solution to enabling sync-like waiting, especially for real-time interactive games, are the following:

  1. The Web Canvas object (i.e. WebGL, WebGPU and Canvas2D APIs) tie their presentation model to the callback/event model of the web. That is, when one yields out from an active event handler, means that the Canvas will present whatever has been rendered so far. But this could lead to only a part of the complete visual rendering being shown to the user, breaking correctness of a renderer. This can be worked out by introducing an intermediate framebuffer and blitting, but costs performance.
  2. The requestAnimationFrame() event callbacks (and part of the presentation mechanism) are scheduled on the display refresh rate. If JSPI/await are introduced into the mix, that effectively means that this synchronization of rendering with the native refresh rate of the display is lost.
  3. Other web events might fire while JSPI/await is waiting for a lock that is supposed to be synchronous. This might cause re-entrancy in the program. To guard against that, a developer is then needed to manually implement a custom global event queue that would block all other web events from progressing. (or be able to site-wide analyze re-entrancy with respect to any locks in the program, requiring possibly auditing inside third-party libraries) Implementing such global site-wide event queue blockers may be tricky, error-prone, especially in the presence of developer utilizing middleware frameworks that they cannot control the code of. In other words, implementing such re-entrancy guards may challenge the composability of JavaScript/Wasm frameworks.

The last bullet point is likely to produce extremely hard to debug bugs, even more so than getting the original shared state synchronization correct with sync atomic locks.


I think the web would be just better off with support for synchronous atomic waits on the main thread. Programming would be simpler, with fewer novel surface areas for bugs, performance would be better, and browser DevTools would be better equipped to diagnose performance and correctness problems that would stem from multithreaded synchronization issues.

Not having sync waits on the main thread does not seem to convincingly have prevented deadlocking or responsiveness problems on the web, but the opposite: needing to write novel synchronization code generally has lead to new correctness and performance bugs that only manifest on the web, which generally don't exist on other platforms. If web pages had sync waiting on the main thread, the behavior of bad behaving web sites would be objectively less worse: sleeping hung wins over device-is-hot-hung, especially so on mobile.

A simple way to implement a watchdog would be to have a browser turn infinite atomic waits to e.g. 30 second atomic waits, and display the slow script dialog after that. Though since such watchdogs already exist in all browser, this would likely not amount to much a change in existing behavior?

I would love to see enabling sync waits on the main thread, and subjecting it to the general page watchdog timer. It seems like it would solve everything?

Alternatively, we really would need to start finding new proposals that would resolve the abovementioned challenges. In some previous conversations Emscripten needs have been dismissed due to "attempting to port legacy codebases to the web". (i.e. assuming that the problem stems from not wanting/being able to refactor large amounts of legacy)

The Emgc garbage collector I mentioned above, is not a legacy codebase port, but a ground up written GC codebase that targets only Wasm/web. Even in that context it is not feasible to utilize Atomics.waitAsync() to solve the required synchronization problems that multithreaded garbage collection would need, but those CPU-hogging spinlocks seem to be needed. We need to find wait solutions that compose well so that such a GC library could readily be used as a middleware library in other projects.


@syg has proposed a microwait primitive, which would provide a sched_yield() or _mm_pause() type of primitive on the web. This is awesome, I would love to see this kind of primitive added to the Atomics API, since so far we have code experimenting with performing Atomics.wait(1 nanoseconds) types of constructs to simulate a yield, which is suspect to say the least. It would be better to have a primitive available on the web that would let code signal their intent directly. This way browsers would be able to utilize the best performing yield construct.

Syg's proposal on clampTimeoutIfCannotBlock on Atomics.wait clamping to 50ms max waits seems unnecessary, and would lead to code implementing for(;;) loops to perform multiple such waits (leading to leaking out spurious futex wakeup semantics). Allowing agents to wait as long as they need without a clampTimeoutIfCannotBlock parameter would seem superior? Or if not, we'd be interested in ideas for concrete alternatives e.g. in that Emgc project or other problems mentioned above to solve what else such code could do instead?

@syg
Copy link

syg commented Mar 7, 2024

@syg has proposed a microwait primitive, which would provide a sched_yield() or _mm_pause() type of primitive on the web.

To clarify, the microwait API as proposed is _mm_pause, not sched_yield. Crucially, _mm_pause is a CPU-level hint to relinquish some shared resources in the CPU without relinquishing the CPU itself at the OS level. sched_yield on the other hand is an OS-level API for the current thread to relinquish the CPU.

Syg's proposal on clampTimeoutIfCannotBlock on Atomics.wait clamping to 50ms max waits seems unnecessary, and would lead to code implementing for(;;) loops to perform multiple such waits (leading to leaking out spurious futex wakeup semantics). Allowing agents to wait as long as they need without a clampTimeoutIfCannotBlock parameter would seem superior? Or if not, we'd be interested in ideas for concrete alternatives e.g. in that Emgc project or other problems mentioned above to solve what else such code could do instead?

Allowing the main thread agent to block is non-negotiable at this moment, so the the clampTimeoutIfCannotBlock parameter is intended to be an inferior-but-better-than-nothing compromise. The point is code today already implement for(;;) loops to perform such waits, except it pegs the CPU even more.

That said, I'm very interested to hear if clampTimeoutIfCannotBlock is actually inferior-and-worse-than-nothing for use cases. In that case, I'm happy to drop it from the proposal and focus only on microwait.

@kettle11
Copy link

kettle11 commented Mar 7, 2024

Allowing the main thread agent to block is non-negotiable at this moment

@syg Why is it non-negotiable and who are the (non) negotiating parties? I haven't seen any recent public arguments against allowing the main thread to block in quite a while.

Many years after the initial discussions we clearly have seen how it's played out for the ecosystem. It's worth discussing again.

@kettle11
Copy link

kettle11 commented Mar 7, 2024

To reiterate: the primary stated purpose of not allowing the main thread to use Atomics.wait is to prevent a deadlock when a worker inadvertently waits on the main thread while the main thread is also waiting.

The reality: Emscripten and Rust, two of the biggest participants in the WebAssembly space, both are forced to busy-loop as a workaround which is strictly worse. This has been the status quo for well over 5 years now.

Given what we know now the responsible thing to do is for whoever is responsible for this decision to at least respond to the points raised in #177 (comment) . There should not be decisions that are treated as simply unquestionable, in particular when so many people have raised valid objections for years. This has been, and continues to be, harmful to the WebAssembly ecosystem.

@juj
Copy link
Author

juj commented Mar 26, 2024

Thanks for all the comments.

I posted in tc39/proposal-atomics-microwait#1 a question about what the semantics of such a capped Atomics wait would be. If the { clampTimeoutIfCannotBlock: true } mechanism will allow main thread code to slice up long waits to multiple shorter ones, then I believe this should be sufficient for Emscripten needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants