-
Notifications
You must be signed in to change notification settings - Fork 819
Convert Imports to Indirect Calls for Dynamic Linking #2523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… convert_import_indirect
… convert_import_indirect
… convert_import_indirect
|
@sbc100 , this is the corresponding work in Binaryen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this showing improvements for you?
Are you running it both on the side module and the main module?
In general I think this looks a lot like what we talked about. I'd like to see a test or two added in the test/lld directory to see the effect of this on the code.
Thanks for your work on the Kevin. I don't think we are ready to land this yet but maybe with a few iterations and a little more investigation on our side. @kripken suggested we come up with a design doc for example.
| */ | ||
|
|
||
| // | ||
| // Turn indirect calls into direct calls. This will improve the runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comments seem backwards, no?
I think needs some more documentation here. IIRC the cost we trying to avoid is being forced to go through JS, even for calls to other wasm modules. In this case going through the table is an alternative form of indirection that should be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry, will update the wrong comment.
src/passes/pass.cpp
Outdated
| createUnteePass); | ||
| registerPass("vacuum", "removes obviously unneeded code", createVacuumPass); | ||
| registerPass("ImportsToIndirectCalls", | ||
| "turns imports into indirect calls", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "turn calls to imported functions into indirect calls"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to "convert calls to imported functions into indirect calls for the main module"
| #include "wasm-io.h" | ||
| #include "wasm-printing.h" | ||
| #include "wasm-validator.h" | ||
| #include "C:\git\binaryen\binaryen\src\passes/passes.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I forgot about this. Will switch to relative paths.
| } | ||
| } | ||
|
|
||
| // Convert the imports to indirects before the Legalization pass to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert calls to imported functions into indirect calls..
Is it the legalization that is the problem or the fact that we have to go through JS at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of them slow down the performance.
| std::istreambuf_iterator<char>()); | ||
|
|
||
| // index.wasm.libfile will have the following format: | ||
| // ['_glClearStencil', '_glUniformMatrix2fv', '_eglGetCurrentSurface'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it would much simpler if the format was just no symbol per line:
_glClearStencil
_glUniformMatrix2fv
_eglGetCurrentSurface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any maybe calls this a "symbols file" .. index.wasm.symbols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 'index.wasm.js_symbols' since the file contains JS symbols?
|
|
||
| if (!module->table.exists) { | ||
| Fatal() << "Table does not exist in module!!!"; | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just create on in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure.
| void visitCall(Call* curr) { | ||
| auto name = curr->target; | ||
|
|
||
| if (!name.isNull()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use early return here:
if (name.isNull()) {
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| auto* func = module->getFunction(name); | ||
|
|
||
| if (func) { | ||
| if (func->imported()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use early return here:
if (func || func->imported()) {
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/asm2wasm.h
Outdated
| passRunner.add("merge-blocks"); | ||
| passRunner.add("optimize-instructions"); | ||
| passRunner.add("post-emscripten"); | ||
| passRunner.add("ImportsToIndirectCalls"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't want to add this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake. Will remove it.
|
|
||
| // Convert the imports to indirects before the Legalization pass to | ||
| // avoid unnecessary legalization | ||
| if (!options.passOptions.arguments["library-file"].empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give this argument a better name too? How about "js-symbols"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
I have updated the Emscripten PR with the benchmark results. On Chrome, we see a 13x improvement while on FF, we see a 7x improvement.
We only need to run this on the main module because the previous PR emscripten-core/emscripten#9917 already ensures that we directly call an exported Wasm function rather than the JS stub. This is already faster than what this PR provides, which is to make an indirect call to an exported Wasm function.
Sure, I can always add my benchmark as a testcase to the repo.
Sounds good, can you point me to an example design doc in case you have a specific format in mind? |
|
Thinking about this some more, I still wish we could simplify everything - make all calls indirect, for example - but I agree that adds some risk and downsides. So I am mostly ok with this approach, if @sbc100 you think this is maintainable and does not add any problems for migrating to a long-term better model for dynamic linking. I'd suggest restructuring this as follows: Not run the pass from finalize, and instead, it will be run from emscripten. A pass can take arguments (see e.g. Then on the Emscripten side, we can run this at the appropriate time. The main |
@kripken , running this pass in finalize has the added benefit of doing the imports to indirects conversion before legalization and that will help in both reducing the size of the binary as well as runtime performance. Should we move the legalization pass out of finalize as well? |
|
@awtcode Interesting question about the legalize pass. Yes, it seems like this should happen before legalize. In fact it seems like moving legalize to the very end is the most efficient thing in general. I think that's worth trying, although there may be issues I don't remember right now. @sbc100 what do you think? |
|
I did some investigating of moving legalization all the way to the end of the emcc.py pipeline. The first serious issue I hit was that if we legalize in finalize, which is early, then we find out if we need |
|
If we always emit getTempRet0/setTempRet0 can't meta-dce then remove them? |
|
Yes, metadce would, so this is not going to increase fully optimized builds. It's still annoying though I think. |
Track the beginning and end of each function, both when reading and writing. We track expressions and functions separately, instead of having a single big map of (oldAddr) => (newAddr) because of the potentially ambiguous case of the final expression in a function: it's end might be identical in offset to the end of the function. So we have two different things that map to the same offset. However, if the context is "the end of the function" then the updated address is the new end of the function, even if the function ends with a different instruction now, as the old last instruction might have moved or been optimized out. Concretely, we have getNewExprAddr and getNewFuncAddr, so we can ask to update the location of either an expression or a function, and use that contextual information. This checks for the DIE tag in order to know what we are looking for. To be safe, if we hit an unknown tag, we halt, so that we don't silently miss things. As the test updates show, the new things we can do thanks to this PR are to update compile unit and subprogram low_pc locations. Note btw that in the first test (dwarfdump_roundtrip_dwarfdump.bin.txt) we change 5 to 0: that is correct since that test does not write out DWARF (it intentionally has no -g), so we do not track binary locations while writing, and so we have nothing to update to (the other tests show actual updating). Also fix the order in the python test runner code to show a diff of expected to encountered, and not the reverse, which confused me.
This only touches test code. The files are compiled with latest LLVM + https://reviews.llvm.org/D71681 in order to get more realistic DWARF content.
This adds EH instruction support for `CFGWalker`. This also implements `call` instruction handling within a try-catch; every call can possibly throw and unwind to the innermost catch block. This adds tests for RedundantSetElimination pass, which uses `CFGWalker`.
Update high_pc values. These are interesting as they may be a relative offset compared to the low_pc. For functions we already had both a start and an end. Add such tracking for instructions as well.
This will make it easier to switch to something else for offsets in wasm binaries if we get >4GB files.
Instead of reinventing the wheel on our side, this adds ExpressionAnalyzer bindings to the C- and JS-APIs, which can be useful for generators. For example, a generator may decide to simplify a compilation step if a subexpression doesn't have any side effects, or simply skip emitting something that is likely to compile to a drop or an empty block right away.
LLVM points to the start of the function in some debug line entries - right after the size LEB of the function, which is where the locals are declared, and before any instructions.
It is convenient to have the full command when debugging fuzzing errors. The fuzzer sometimes fails before running `wasm-reduce` and being able to reproduce the command right away from the log is very handy in that case.
Instead of hackishly advancing the read position in the binary buffer, call readExpression which will do that, and also do all the debug info handling for us.
Binaryen.js now uses offset instead of byteOffset when inspecting
a memory segment, matching the arguments on memory segment
creation. Also adds inspection of the passive property.
Previously, one would specify { offset, data, passive } on creation
and get back { byteOffset, data } upon inspection. This PR unifies
both to the keys on creation while also adding the respective C-API
to retrieve passive status, which was missing.
Fixes the testcase in WebAssembly#2343 (comment) Looks like that's from Rust. Not sure why it would have an invalid abbreviation code, but perhaps the LLVM there emits dwarf differently than we've tested on so far. May be worth investigating further, but for now emit a warning, skip that element, and don't crash. Also fix valgrind warnings about Span values not being initialized, which was invalid and bad as well (wasted memory in our maps, and might have overlapped with real values), and interfered with figuring this out.
…WebAssembly#2603) Control flow structures have those in addition to the normal span of (start, end), and we need to track them too. Tracking them during reading requires us to track control flow structures while parsing, so that we can know to which structure an end/else/catch refers to. We track these locations using a map on the side of instruction to its "extra" locations. That avoids increasing the size of the tracking info for the much more common non-control flow instructions. Note that there is one more 'end' location, that of the function (not referring to any instruction). I left that to a later PR to not increase this one too much.
DWARF from LLVM can refer to the first byte belonging to the function, where the size LEB is, or to the first byte after that, where the local declarations are, or the end opcode, or to one byte past that which is one byte past the bytes that belong to the function. We aren't sure why LLVM does this, but track it all for now. After this all debug line positions are identified. However, in some cases a debug line refers to one past the end of the function, which may be an LLVM bug. That location is ambiguous as it could also be the first byte of the next function (what made this discovery possible was when this happened to the last function, after which there is another section).
While line and address values of 0 should be skipped, it seems like column 0 are valid lines emitted by LLVM.
We need to track end_sequence directly, and use either end_sequence or copy (copy emits a line without marking it as ending a sequence). After this, fib2 debug line output looks perfect.
Just some trivial fixes: * Properly reset prologue after each line (unlike others, this flag should be reset immediately). * Test for a function's end address first, as LLVM output appears to use 1-past-the-end-of-the-function as a location in that function, and not the next (note the first byte of the next function, which is ambiguously identical to that value, is used at least in low_pc; I'm not sure if it's used in debug lines too). * Ignore the same address if LLVM emitted it more than once, which it does sometimes.
Pretty straightforward given all we have so far. Note that fannkuch3_manyopts has an example of a sequence of ranges of which some must be skipped while others must not, showing we handle that by skipping the bad ones and updating the remaining. That is, if that we have a sequence of two (begin, end) spans [(10, 20), (30, 40)] It's possible (10, 20) maps in the new binary to (110, 120) while (30, 40) was eliminated by the optimizer and we have nothing valid to map it to. In that case we emit [(110, 120)]
…Assembly#2613) Chrome is currently decoding the segment indices as signed numbers, so some ranges of indices greater than 63 do not work. As a temporary workaround, limit the number of segments produced by MemoryPacking to 63 when bulk-memory is enabled.
# Conflicts: # src/wasm/wasm-debug.cpp # test/passes/fannkuch3.bin.txt # test/passes/fannkuch3_manyopts.bin.txt
|
Not required anymore. |
Inter-module calls in dynamic linking currently go thru JS stubs. Convert the calls to Wasm Indirect calls for the following reasons:
The Emscripten portion of the work is done in this PR emscripten-core/emscripten#10003