Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Jiterpreter multithreading support #88279

Merged
merged 8 commits into from
Aug 15, 2023
Merged

Conversation

kg
Copy link
Contributor

@kg kg commented Jun 30, 2023

This PR enables jiterpreter traces when threading is enabled by making lots of changes:

  • We pre-allocate function table entries for each type of jitted code, and fill all those slots with safe placeholder functions that can be called from any thread without crashing. For interp_entry we install the default implementation, for traces we install a function that skips the enter/monitor point without doing anything, and for do_jit_call we install a function that sets a special 'use the slow path' flag. The placeholder for traces is also responsible for ensuring that each thread gets its own compiled copy of traces in order to run them.
  • When getting a function pointer for jitted code, we atomically reserve a slot in the pre-allocated tables and install it in that slot on the current thread. Other threads will have the placeholder function in that slot, but when it's called it will automatically JIT the trace again and install it.
  • All jiterpreter opcodes were changed to use an 8-byte layout that allows us to patch them using 2-byte atomics:
    JITERPRETER_xxx [2-byte relative function pointer value] [4-byte trace index]
    This means that to run a trace we need to add the value of a global to the function pointer we loaded from memory, but the overhead of that is not too bad. The existing opcode patching model where we reuse the 4-byte storage for different values just was not possible to do in a thread-safe way given WASM's alignment constraints.
  • Code that manipulates jiterpreter opcodes now uses a struct that explicitly specifies their layout instead of explicitly doing READ32 and whatnot. This mostly was just to make it easier to update all the code that touched them when I changed the layout.
  • The jiterpreter's hit counting logic is updated to also use atomics so that only one thread will ever observe the counter hitting a threshold, and so that other threads shouldn't race to JIT code or patch an opcode. In some cases it seems like the race still happens, but since we use atomics it generates a warning but doesn't crash.
  • The jiterpreter's monitoring implementation is updated to use atomics.
  • New runtime options are added for configuring the size of the jiterpreter's tables. (For some reason, the interp_entry tables need to be REALLY big). It seems likely that in production we may hit cases where the tables are too small or too big, since there is a limit to how big the function table can get and an AOT'd binary of sufficient size may already have 100k+ table entries.
  • All jiterpreter counters are now tracked in C instead of JS and updated using atomics, which allows us to share that data (like number of traces compiled, number of bytes generated) between threads.
  • As mentioned above, we JIT a unique copy of the trace for each thread that needs to run it, so we can install it into the function pointer table for that thread. In my testing in Chrome's devtools the unique copies of the traces all have the same hash, which hopefully means that v8 is able to share the compiled code and potentially other data between them. This does mean that we will waste CPU re-jitting (this is fixable, but a lot of work) and waste memory on duplicating traces.
  • Traces now take an extra argument (sigh) that contains the address of the trace entry point opcode. This is used so that the placeholder trace function can acquire its own trace index in order to JIT itself on other threads.
  • I patched our emscripten libraries to expose a getter for the wasm memory object so that we have access to it in multi-threaded configurations. Existing methods of getting it don't work when threads are active, see How should I access the WebAssembly.Memory instance? emscripten-core/emscripten#19832
  • I did some work to try and make the jit_call and interp_entry (AOT) stuff thread safe, but it doesn't work yet, so those are still disabled.
  • I changed most of the jiterpreter atomics code to use stdatomic.h because I was having problems with mono_atomic and we are missing some of the atomic ops for 2-byte types. (I later determined that our atomics work fine, it was some other sort of bug on my end). It might be necessary to undo these changes, but we know that the wasm target should always have this header so it's at least safe to use them (and I verified the generated code has the wasm atomics I expect.)
  • Trace transfer was removed because making it work with the addition of the new ip argument to traces was nontrivial and I am not sure the optimization is that valuable anymore.

So far I've tested this with the basic browser threading sample (fibonacci) and the threaded version of the raytracer that pavel and aleksey put together.

@ghost
Copy link

ghost commented Jun 30, 2023

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR contains initial work to make the jiterpreter thread safe. Key changes:

  • We pre-allocate function table entries for each type of jitted code, and fill all those slots with safe placeholder functions that can be called from any thread without crashing. For interp_entry we install the default implementation, for traces we install a function that skips the enter/monitor point without doing anything, and for do_jit_call we install a function that sets a special 'use the slow path' flag.
  • When getting a function pointer for jitted code, we atomically reserve a slot in the pre-allocated tables and install it in that slot on the current thread. At present other threads will never install it, but they shouldn't crash when calling the function pointer. This means the first thread to JIT the code wins and it will only be fast on that thread at present.
  • All of the jiterpreter's opcode patching logic (unless I missed some) now uses 4-byte or 8-byte atomic memory operations so that the patching can happen safely without other threads observing incomplete writes (this is necessary to prevent the 'MONITOR [index]' -> 'ENTER [fn ptr]' transition from causing crashes by being incomplete).
  • The jiterpreter's hit counting logic is updated to also use atomics so that only one thread will ever observe the counter hitting a threshold, and so that other threads won't race to JIT code or patch an opcode.
  • The jiterpreter's monitoring implementation is updated to use atomics.
  • New runtime options are added for configuring the size of the jiterpreter's tables. (For some reason, the interp_entry tables need to be REALLY big). It seems likely that in production we may hit cases where the tables are too small or too big, since there is a limit to how big the function table can get and an AOT'd binary of sufficient size may already have 100k+ table entries.

I haven't been able to test any of this with threading enabled because for some reason right now I can't get any of the samples to work right with threading turned on, presumably something bit-rotted. It all seems to work with threads disabled, at least.

Author: kg
Assignees: -
Labels:

area-Codegen-Interpreter-mono

Milestone: -

@kg
Copy link
Contributor Author

kg commented Jul 11, 2023

There are two problems with enabling traces in threaded builds right now:

  • Some emscripten stdlib function(s), presumably atomics, explode for unaligned accesses. I'm not sure what to do about this, since it seems like they probably require 8-byte alignment and opcodes are 2-byte aligned. We might need a global lock for opcode patching, which really sucks.
  • Module.asm.memory is not a WebAssembly.Memory in threaded builds for some reason. I opened an emscripten issue to ask how I'm supposed to get at the Memory object.

@kg kg changed the title [wasm] Jiterpreter thread safety infrastructure [wasm] Jiterpreter multithreading support Jul 22, 2023
@kg
Copy link
Contributor Author

kg commented Jul 22, 2023

Almost ready for review. Despite what the comment says, I realized it's not actually fine to use the jitcall table for do_jit_call_indirect, we need special handling for that function

@kg kg added the arch-wasm WebAssembly architecture label Jul 22, 2023
@ghost
Copy link

ghost commented Jul 22, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR enables jiterpreter traces when threading is enabled by making lots of cchanges:

  • We pre-allocate function table entries for each type of jitted code, and fill all those slots with safe placeholder functions that can be called from any thread without crashing. For interp_entry we install the default implementation, for traces we install a function that skips the enter/monitor point without doing anything, and for do_jit_call we install a function that sets a special 'use the slow path' flag. The placeholder for traces is also responsible for ensuring that each thread gets its own compiled copy of traces in order to run them.
  • When getting a function pointer for jitted code, we atomically reserve a slot in the pre-allocated tables and install it in that slot on the current thread. Other threads will have the placeholder function in that slot, but when it's called it will automatically JIT the trace again and install it.
  • All jiterpreter opcodes were changed to use an 8-byte layout that allows us to patch them using 2-byte atomics:
    JITERPRETER_xxx [2-byte relative function pointer value] [4-byte trace index]
    This means that to run a trace we need to add the value of a global to the function pointer we loaded from memory, but the overhead of that is not too bad. The existing opcode patching model where we reuse the 4-byte storage for different values just was not possible to do in a thread-safe way given WASM's alignment constraints.
  • Code that manipulates jiterpreter opcodes now uses a struct that explicitly specifies their layout instead of explicitly doing READ32 and whatnot. This mostly was just to make it easier to update all the code that touched them when I changed the layout.
  • The jiterpreter's hit counting logic is updated to also use atomics so that only one thread will ever observe the counter hitting a threshold, and so that other threads shouldn't race to JIT code or patch an opcode. In some cases it seems like the race still happens, but since we use atomics it generates a warning but doesn't crash.
  • The jiterpreter's monitoring implementation is updated to use atomics.
  • New runtime options are added for configuring the size of the jiterpreter's tables. (For some reason, the interp_entry tables need to be REALLY big). It seems likely that in production we may hit cases where the tables are too small or too big, since there is a limit to how big the function table can get and an AOT'd binary of sufficient size may already have 100k+ table entries.
  • All jiterpreter counters are now tracked in C instead of JS and updated using atomics, which allows us to share that data (like number of traces compiled, number of bytes generated) between threads.
  • As mentioned above, we JIT a unique copy of the trace for each thread that needs to run it, so we can install it into the function pointer table for that thread. In my testing in Chrome's devtools the unique copies of the traces all have the same hash, which hopefully means that v8 is able to share the compiled code and potentially other data between them. This does mean that we will waste CPU re-jitting (this is fixable, but a lot of work) and waste memory on duplicating traces.
  • Traces now take an extra argument (sigh) that contains the address of the trace entry point opcode. This is used so that the placeholder trace function can acquire its own trace index in order to JIT itself on other threads.
  • I patched our emscripten libraries to expose a getter for the wasm memory object so that we have access to it in multi-threaded configurations. Existing methods of getting it don't work when threads are active, see How should I access the WebAssembly.Memory instance? emscripten-core/emscripten#19832
  • I did some work to try and make the jit_call and interp_entry (AOT) stuff thread safe, but it doesn't work yet, so those are still disabled.
  • I changed most of the jiterpreter atomics code to use stdatomic.h because I was having problems with mono_atomic and we are missing some of the atomic ops for 2-byte types. (I later determined that our atomics work fine, it was some other sort of bug on my end). It might be necessary to undo these changes, but we know that the wasm target should always have this header so it's at least safe to use them (and I verified the generated code has the wasm atomics I expect.)
  • Trace transfer was removed because making it work with the addition of the new ip argument to traces was nontrivial and I am not sure the optimization is that valuable anymore.

So far I've tested this with the basic browser threading sample (fibonacci) and the threaded version of the raytracer that pavel and aleksey put together.

Author: kg
Assignees: kg
Labels:

arch-wasm, area-Codegen-Interpreter-mono

Milestone: -

@kg kg marked this pull request as ready for review July 25, 2023 08:20
@kg
Copy link
Contributor Author

kg commented Jul 25, 2023

I'm not sure what the right solution is for the worker interop exports/imports, I was unable to successfully build locally until I removed all the conditional logic around them - the linker was complaining it couldn't find those two functions. I'm not sure it's worth making them conditional since their implementation is guarded and they're small.

@kg kg added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 26, 2023
@kg
Copy link
Contributor Author

kg commented Jul 26, 2023

I expect this to cause some small amount of performance regression in single-threaded configurations, so setting NO-MERGE until the perf tooling is all functioning as expected again and we will be able to spot any unexpected changes

@kg
Copy link
Contributor Author

kg commented Jul 27, 2023

Tried reverting the corebindings changes, but when I do I can't run tests locally no matter what I do. Will see if it works on CI.

Checkpoint

Checkpoint

Checkpoint thread safety work

Checkpoint

Fix build

Disable traces in threaded builds due to emscripten compat issue

STASH: Debugging atomics issue

Checkpoint threading support

Allocate jiterp tables in each thread's startup

Make jiterpreter counters work cross-thread

Checkpoint

Checkpoint

Checkpoint

Checkpoint

Basic implementation of trace self-healing so all threads will eventually jit all traces

Don't force atomics anymore

Checkpoint

Cleanup

Fix jiterp elapsed time

Disable table allocation message unless stats are active

Fix build

Fix missing linker imports in threaded build
Fix do_jit_call dispatcher attempting to use not-yet-allocated JIT tables
Don't reuse the jit call wrapper table for do_jit_call dispatcher

Raise default table size

Fix typo

Comment out debug print

Fix boost_back_branch_target

Silence unhelpful clang warning

Silence another warning

Address PR feedback: Unify things around Module.getMemory

Revert corebindings changes
@pavelsavara
Copy link
Member

I will start reading the code in a minute, but before that, few questions on the description.

* The jiterpreter's hit counting logic is updated to also use atomics 
* The jiterpreter's monitoring implementation is updated to use atomics.
* All jiterpreter counters are now tracked in C instead of JS and updated using atomics

Is all the atomics introduced here only in place for MT build flavor ? Is there perf or size impact on non-MT build ?

(For some reason, the interp_entry tables need to be REALLY big). 

Could you please elaborate, I probably don't have the context to understand what this is about.

we will waste CPU re-jitting (this is fixable, but a lot of work) 

We could have "repository" of SharedArrayBuffer blobs by trace index on the main thread and broadcast updates to workers. Probably another PR.

* Traces now take an extra argument (sigh) that contains the address of the trace entry point opcode. 

You could install different wrapper to each slot, like table[currentIndex] = placeholderFn.apply(null, currentIndex). That will let you to call it without the extra argument from WASM and the generated trace doesn't need to know about the extra arg.

* I patched our emscripten libraries to expose a getter for the wasm memory 

I still want it strongly typed :)

* I changed most of the jiterpreter atomics code to use stdatomic.h ... It might be necessary to undo these changes

Was this undone already ?

* Trace transfer was removed 

What is "Trace transfer " ?

src/mono/wasm/runtime/jiterpreter.ts Show resolved Hide resolved
src/mono/wasm/runtime/jiterpreter.ts Outdated Show resolved Hide resolved
src/mono/wasm/runtime/jiterpreter-trace-generator.ts Outdated Show resolved Hide resolved
src/mono/wasm/runtime/jiterpreter-support.ts Show resolved Hide resolved
src/mono/wasm/runtime/jiterpreter-support.ts Show resolved Hide resolved
src/mono/mono/mini/interp/jiterpreter.h Show resolved Hide resolved
src/mono/mono/mini/interp/jiterpreter.c Show resolved Hide resolved
src/mono/mono/mini/interp/jiterpreter.c Outdated Show resolved Hide resolved
src/mono/mono/mini/interp/jiterpreter.c Show resolved Hide resolved
src/mono/mono/mini/interp/interp.c Outdated Show resolved Hide resolved
@kg
Copy link
Contributor Author

kg commented Aug 10, 2023

I will start reading the code in a minute, but before that, few questions on the description.

* The jiterpreter's hit counting logic is updated to also use atomics 
* The jiterpreter's monitoring implementation is updated to use atomics.
* All jiterpreter counters are now tracked in C instead of JS and updated using atomics

Is all the atomics introduced here only in place for MT build flavor ? Is there perf or size impact on non-MT build ?

Right now most of the atomics are not ifdefed out, but size increase is a good reason to do it, so I'll do it.

(For some reason, the interp_entry tables need to be REALLY big). 

Could you please elaborate, I probably don't have the context to understand what this is about.

We have to pre-reserve tables for each kind of generated code. I haven't been able to figure out why, but the ones for interp_entry - transitions from native code to interpreter - have to be very big. The size I picked actually isn't big enough for System.Text.Json.Tests.

we will waste CPU re-jitting (this is fixable, but a lot of work) 

We could have "repository" of SharedArrayBuffer blobs by trace index on the main thread and broadcast updates to workers. Probably another PR.

The problem is that we need to reconstruct the correct imports on each thread, so we would need to serialize all of that somewhere while building the trace the first time and stash it. Some constants in the trace could also be thread-local, so we would need to record their locations and patch them.

* Traces now take an extra argument (sigh) that contains the address of the trace entry point opcode. 

You could install different wrapper to each slot, like table[currentIndex] = placeholderFn.apply(null, currentIndex). That will let you to call it without the extra argument from WASM and the generated trace doesn't need to know about the extra arg.

Yeah, I considered doing this. The problem is that we would now need to either include ~4000 unique dummy functions in the .wasm file (not reasonable) or generate them at startup (also kind of unreasonable)

* I patched our emscripten libraries to expose a getter for the wasm memory 

I still want it strongly typed :)

Yeah, I need to figure this out. Thanks for the reminder.

* I changed most of the jiterpreter atomics code to use stdatomic.h ... It might be necessary to undo these changes

Was this undone already ?

No. As I mentioned on discord, I decided to stick with stdatomic because we don't have mono_ versions of all the atomics I need for the types I'm using, and adding new mono_ atomics would require me to implement them on every platform. So I decided jiterpreter.c specifically will only use stdatomic

* Trace transfer was removed 

What is "Trace transfer " ?

It's an old optimization that predates the jiterp having support for backward branches. It used to be worthwhile but in my testing it doesn't actually do anything anymore, so instead of making it thread-safe I just removed it.

@kg
Copy link
Contributor Author

kg commented Aug 11, 2023

LibraryTests failure is in hybrid ICU, looks like maybe the browser ICU data doesn't match what we expect it to be

@pavelsavara
Copy link
Member

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


// Atomically update the hit count. We may lose a race here, but that's not a big
// problem since losing the race indicates that the trace entry point is probably hot.
mono_atomic_cas_i64 (&trace_info->hit_count, new_hit_count, old_hit_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifdef DISABLE_THREADS

@pavelsavara
Copy link
Member

pavelsavara commented Aug 11, 2023

(For some reason, the interp_entry tables need to be REALLY big).

We have to pre-reserve tables for each kind of generated code. I haven't been able to figure out why, but the ones for interp_entry - transitions from native code to interpreter - have to be very big. The size I picked actually isn't big enough for System.Text.Json.Tests.

Do you mean that interp needs to to have many re-entry thunks ? Do they have different signatures ? Do they carry some values in a closure ? Does the number of them grow with number of interpreted methods ? Or with number of jiterp traces ?

We could have "repository" of SharedArrayBuffer blobs

Some constants in the trace could also be thread-local

Is that pointer to a local stack ? To instance of something that is thread specific ? Could we have TLS getter for jiterp ?

  • Traces now take an extra argument

You could install different wrapper to each slot

Yeah, I considered doing this. The problem is that we would now need to either include ~4000 unique dummy functions in the .wasm file (not reasonable) or generate them at startup (also kind of unreasonable)

I was not proposing wasm methods, I was proposing to create JS wrappers during start. Functions 4000 per thread is not too bad, they would be each executed just once, and therefore not JITed by the browser.

But I don't know, maybe the extra argument doesn't have significant performance hit and it it's not worth it ?

adding new mono_ atomics would require me to implement them on every platform.

Why you can't just implement wasm and let the other platforms throw? As the methods are new, they would not be used elsewhere until needed.

@kg
Copy link
Contributor Author

kg commented Aug 12, 2023

(For some reason, the interp_entry tables need to be REALLY big).

We have to pre-reserve tables for each kind of generated code. I haven't been able to figure out why, but the ones for interp_entry - transitions from native code to interpreter - have to be very big. The size I picked actually isn't big enough for System.Text.Json.Tests.

Do you mean that interp needs to to have many re-entry thunks ? Do they have different signatures ? Do they carry some values in a closure ? Does the number of them grow with number of interpreted methods ? Or with number of jiterp traces ?

Yes, each thunk is dedicated to a specific method. They have different signatures (there's one table per signature) and they close over their target method

We could have "repository" of SharedArrayBuffer blobs

Some constants in the trace could also be thread-local

Is that pointer to a local stack ? To instance of something that is thread specific ? Could we have TLS getter for jiterp ?

Yeah some sort of TLS getter is one potential solution here, but it would make things much slower. Another option is to use a global import for the constant and then specify a different value each time we instantiate the module - I did tests with imported globals for constants before, and while it worked it made traces slower, so it's not enabled right now.

The thread local values are things like scratch buffers.

  • Traces now take an extra argument

You could install different wrapper to each slot

Yeah, I considered doing this. The problem is that we would now need to either include ~4000 unique dummy functions in the .wasm file (not reasonable) or generate them at startup (also kind of unreasonable)

I was not proposing wasm methods, I was proposing to create JS wrappers during start. Functions 4000 per thread is not too bad, they would be each executed just once, and therefore not JITed by the browser.

But I don't know, maybe the extra argument doesn't have significant performance hit and it it's not worth it ?

You can't put JS functions directly into the WASM function table right now. You have to first import them into a WASM module and then re-export them, which is what Module.addFunction does. So we would end up needing to instantiate a bunch of modules in order to create these wrappers. The cost of that might not be too bad, but it's complicated to implement and hard to measure the impact.

adding new mono_ atomics would require me to implement them on every platform.

Why you can't just implement wasm and let the other platforms throw? As the methods are new, they would not be used elsewhere until needed.

I guess I could do that, but it makes me uncomfortable :)

kg added 2 commits August 11, 2023 17:33
Adjust default table size a bit
@kg
Copy link
Contributor Author

kg commented Aug 12, 2023

Re: earlier questions, Module.getMemory() returns a WebAssembly.Memory in both ST and MT builds as expected (with Module.getMemory().buffer being a SharedArrayBuffer in MT builds)

@kg kg removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 12, 2023
@kg kg merged commit 333c2c7 into dotnet:main Aug 15, 2023
104 checks passed
@ghost ghost locked as resolved and limited conversation to collaborators Sep 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants