-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasm2c: Allow use of x86 segment register for linear memory access #2395
Conversation
Nice! Thanks for working on this. I've never heard this called "Segue" optimization, is that a well known term? Perhaps the title would be better understood if it said something like "Use x86 segment register for memory access"? I seem to remember from the NaCl days that we couldn't use the x86 segment register on x86_64... but maybe that was only for running x86_32 binaries on x86_64 OS? Is that right @dschuff? |
Ah this is a name we came up with as part of our writeup as an easy shorthand for this. I'm happy to update the title of the commit/PR to just say "x86 segments".
Sharing some details from my exploration below in case it helps, while we wait to hear from dschuff It seems that a subset of the x86 segment register survived to x86_64. x86_64 got rid of the code segment From playing with NaCl's code base, I believe NaCl32 used
So I believe the exact approach of NaCl32 was no longer compatible in x86_64 since the segments and bounds checks NaCl32 used went away. (Not relevant here, but I have tested that NaCl32 works fine if you run a x86_32 binary in an x86_64 OS as CPUs and OSes remain backward compatible.) NaCl64 in contrast uses an alternate approach relying on a combination of (1) emitting separate instructions to perform the memory-base addition (2) using guard pages (3) moving the stack & heap in a single contiguous memory region (4) code-pointer masking to implement a coarse-grain CFI to enforce sandboxing. |
Yeah, my memory is that the original plan was just to use NaCl32 all the time even on x86-64 systems with 64-bit OS (until we started needing >4G of memory), but IIRC even that didn't work on Win64 because the OS was clobbering segment registers on context switches. So we (really, "they" - this was before my time in NaCl) had to design the x86-64 variant of NaCl sooner than they would have otherwise. And as you noted, no segment registers (although in some cases IIRC it actually does use fs or gs for TLS). |
@dschuff Ah gotcha, yup, i did a quick test and it looks like the unused segment register on windows gets clobbered during system calls --- which the tests in wasm2c don't really reveal, since these are mostly pure compute. I think there maybe a solution to reset the segment register after any call to external functions (like wasi-calls which may invoke syscalls). However, if this happens when the process is context switched out as you say, maybe this won't be sufficient either. In the interrim while i figure this out, i'll update this patch to be Linux only for now. |
@sbc100 I've made the above change. Let me know your thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks surprisingly non-intrusive!
This looks like a cool optimization! Is there an available reference/documentation from Linux and glibc that we can point to that says the register is protected for use by userspace processes (notwithstanding scheduler interrupts, syscalls, signals, the threading implementation, etc.)? I'm obviously a bit nervous about the maintenance and support burden of carrying a clang-only, Linux-only experimental feature upstream if we have a way to factor things out. Would you have interest in splitting this into the c-writer changes to enable this (which would go upstream), and creating an extensibility mechanism (maybe a command-line flag for a custom |
(@sbc100 @keithw Oh one extra note from above... I was still validating this during my last post, so I didn't share earlier... but we do seem to be seeing one workload in Firefox eliminate 75% of the Wasm overhead due to this, which makes this particularly interesting to me. This speedup tends to be somewhat workload dependent though.)
Let me look into this. In practice, we have seen that this is never used even in large applications, but i'll see if I can find any documented reference.
@keithw Sure. While my first choice would be to just have this supported behind suitable optional/experimental flags, I am happy as long as there is a practical way I can enable this. The only caveat being --- I really don't want to have to maintain any custom patches, as this would require me to keep this in sync with in-tree. Happy to try out any technique you have in mind. Could you describe it in more detail? Right now I'm not sure how I can restrict the changes to c-writer.cc only as i have to modify the Alternately, if you want to create an
Some additional context from my side: Fwiw, I am investigating how we can support other compilers and OS as well. I believe it should be possible to widen the support in future PRs. |
Wow! That is really awesome. |
Quoting from https://www.kernel.org/doc/html/next/x86/x86_64/fsgs.html
@keithw Just wanted to bump the discussion so we can continue to find a path to land this. As mentioned above, we would really like to take advantage of the 75% overhead elimination soon. |
Cool, that's helpful!
I think I see a few good paths. Path one, this is a Linux+clang+Ivy-Bridge-and-later-only experiment that we're not supporting or documenting (yet) or worrying (yet) about nailing the corner cases (e.g. saving the register across calls to imported functions). If you want the experiment to be CI-tested upstream to avoid the burden of maintaining it out-of-tree, understandable, but it would be nice not to do that by adding more preprocessor macros and modes to the current declarations and having them live there indefinitely, because that also creates a burden. Maybe a middle ground is to declare this a time-limited experiment for n months? Or, another middle ground, the wasm2c tool take a command-line flag for an external declarations file (and c-writer gets minimally changed), you ship that file in an experiments directory, and it runs in the WABT CI to prevent bitrot? Path two is, we're talking about adding OS/compiler/CPU-gen-specific intrinsics to the wasm2c output in a supported configuration (e.g. we're willing to have a CVE if this goes awry). The performance win sounds awesome, but I think we'd have to figure out the corner cases before we can document/support this. Also: if we're doing this for performance reasons, I think we'd want the benchmark results to be ones anybody can see in full form and reproduce -- you want the flexibility to keep improving the approach over time if somebody gets an idea (and you want new contributors to be on equal footing about whether their perf-motivated PRs are wins). Whatever that benchmark is, it would be nice if somebody can figure out how to match the performance in pure C. I realize GS is powerful because it can use segment-relative addressing, but maybe putting the mem0 address in a function parameter gives almost the same performance in practice, or maybe some form of LTO/PGO (suggested the last time this came up in #1805) narrows the gaps, etc. If LLVM or V8 took every optimization proposal where the proposer said it was a win for them in a particular workload, I think they would have painted themselves in a corner. :-/ Hope that gives a better sense of where I'm coming from. Path three is, sell @sbc100 on this PR and don't worry about me. |
if we may chip in, we'd love to see support for external declarations! |
Sorry for the delay. Got tied up with some other work. @keithw Sure, I can investigate some of those paths, but I should provide some more context for some low-level details before proceeding
I believe this should already work. The imported function is invoked via its external interface vs. the module's internal implementation of the function in a static function. The external interface of all functions save the current value of %gs, load the current module's memory into %gs, and then restore the original %gs once complete. This bakes in the assumption, that each module has its own memory. If a module has more than one memory, segue is not used for that module and %gs is left undisturbed. I'm hoping that this should work fine in all scenarios: when one module calls another and one of the two modules, or both modules, have a single or multiple memories, and the tests do pass. Let me know if you believe I had missed something.
Unfortunately, this does not seem to be the case. While it is true that putting the mem0 address in a function parameter may help, this would in practice only reduce one memory dereference per function, which is not really going to add up to much. Rather the performance of this optimization comes from a few different places
The 75% number comes from a Wasm sandboxed graphite font rendering workload in Firefox, and LTO is enabled in Firefox
Let me push back on this a bit. I think this is entirely reasonable argument to make this argument for a 1% to 2% optimization (indeed I made this argument here when discussing only pinned registers). However, I think the substantial performance boost here changes the equation. More concretely, if elimination of 75% of Wasm's overhead is not good enough to make the cut in compilers, then I think LLVM/V8 would be a hundredth of their current size, if not even smaller. I think the reason to consider this optimization is precisely the outsized impact it has on certain workloads on a large proportion of CPUs actually in use today. Additionally the baseline performance boost of about 5% to 11% (this translates to roughly eliminating 44.7% of overhead on SPEC 06) is pretty substantial even by itself.
Sure, let me investigate these options further. |
That works for imports from other wasm2c-generated modules, but what should we do about imports from the host?
Both of these seem like they have some downsides; maybe there's a better option C? |
Sorry for the long delay here. I had to deal with some production bugs in Firefox in March and got distracted with paper submissions and reviews. I am now back looking at this. Will go through the above thread to recapture all the context and will follow up shortly. |
@keithw For this option, I think we can document it as you mention, but we should require no other change in the host code, since no examples would use
This seems like overkill to me, especially given that this is not a feature that would be on-by-default. I propose we look at Option A, and I will include documentation and asserts for the same. |
@keithw I have pushed a version addressing the comments above. It includes documentation, an extra mode with more sanity checks which we use in testing, and a benchmark (Dhrystone) to measure the performance difference (On my machine the use of segue makes dhrystone run in about 60% of the time as without segue) . @keithw @sbc100 Could you have a look when you get the chance? |
dcf0db3
to
65ca36b
Compare
thaks |
Apologies @shravanrn, I was away at the CG meeting and I'm out sick this week. Lets try to get this landed somehow next week. |
@sbc100 ah gotcha, next week sounds good! Hope you feel better soon :) |
@sbc100 Following up on the last message. Let me know how we can go forward here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % a few comments
@@ -3725,6 +3782,9 @@ void CWriter::Write(const ExprList& exprs) { | |||
Write(StackVar(0), " = ", func, "(", | |||
ExternalInstancePtr(ModuleFieldType::Memory, memory->name), ", ", | |||
StackVar(0), ");", Newline()); | |||
if (IsSingleUnsharedMemory()) { | |||
InstallSegueBase(module_->memories[0], false /* save_old_value */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if its worth coming up with a descriptive name for this? Segue seems a little bit like a "TM" name. I'm not sure I can come up with better name though. Maybe this is fine as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong view on this.
Wamr seems to have shipped it with the Segue name https://github.com/bytecodealliance/wasm-micro-runtime/releases/tag/WAMR-1.2.3
I'm leaning towards leaving it as is for consistency, but can change this to Segment
if you prefer.
wasm2c/examples/segue/dhrystone.wasm
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to checkin this binary file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the pattern in the wasm2c/examples
folder which seems to have checked in a wat file. I figured checking in the wasm file will allow anyone without a wasi-clang compiler to still run tests. Happy to remove it if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have WASM_RT_SANITY_CHECKS
be linked the NDEBUG
macro by default? That way it would have sensible default and most folks would never need to set it explicitly?
@sbc100 I considered this and unfortunately this doesn't work. Here is my reasoning
So I think we may have to leave this as is. Please let me know if you have an alternate suggestion. |
@sbc100 I have pushed a fix to address the comments, modulo a couple of comments where I need some extra input. If it looks good, I can land it once you approve the PR. |
This PR implements the "Segue" performance optimization in Wasm2c. Segue uses the x86 segment register to perform memory accesses to Wasm's linear heap. In modern OSes, only one of the two segment registers are used (to support thread-local storage), leaving the other free for our use here. In our earlier work, we found that this speeds up wasm2c by something like 5% to 11% although there are workloads that seem to benefit far more.
Full writeup from last year here. More detailed writeup in the works.
Since that initial writeup, the Wamr team has implemented a limited form of Segue (doesn't take full advantage of all the increased flexibility that comes with segment based encodings) in their compiler. See bytecodealliance/wasm-micro-runtime#2230. In direct conversations, they have also reported seeing performance benefits between 5% to 8%. I'll note that it is far easier for Wasm2c to take full advantage of Segue than Wamr (an artifact of Wasm2c's C-to-native pipeline).
Implementation
This is fairly straightforward in Wasm2c.
memcpy
in DEFINE_LOAD and DEFINE_STORE to pass in a pointer whose type is a segment pointer (supported via named address namespaces)The rest of the changes are just boilerplate on checking for support and correctly falling back if not supported.
Testing
I've added this to the testing for the rlbox version of the CI
Restrictions
It seemed prudent to have this feature off by default since its brand new, and the performance benefits are workload dependent (Some pathological workloads do show minor performance regressions). Users must explicitly opt into this via a macro.
For simplicity, Segue will only be enabled on Wasm modules that use a single (imported or internal) memory that is not shared (i.e., does not have multiple threads accessing it). Implementing Segue on shared memories would require atomic versions of segmented memory access which may require assembly code.
Although, it is potentially possible to support this for all x86_64 compilers, I've restricted this to clang only for now due to some complexities (listed below) and lack of an immediate use case for other compilers (Firefox for now will probably enable this only on clang builds). Complexities for other compilers:
EDIT: