Skip to content

Conversation

@awtcode
Copy link

@awtcode awtcode commented Dec 11, 2019

Inter-module calls in dynamic linking currently go thru JS stubs. Convert the calls to Wasm Indirect calls for the following reasons:

  1. Faster performance
  2. Skip legalization
  3. Remove JS stubs (To be done in a later PR)

The Emscripten portion of the work is done in this PR emscripten-core/emscripten#10003

@awtcode
Copy link
Author

awtcode commented Dec 11, 2019

@sbc100 , this is the corresponding work in Binaryen.

Copy link
Member

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this showing improvements for you?

Are you running it both on the side module and the main module?

In general I think this looks a lot like what we talked about. I'd like to see a test or two added in the test/lld directory to see the effect of this on the code.

Thanks for your work on the Kevin. I don't think we are ready to land this yet but maybe with a few iterations and a little more investigation on our side. @kripken suggested we come up with a design doc for example.

*/

//
// Turn indirect calls into direct calls. This will improve the runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments seem backwards, no?

I think needs some more documentation here. IIRC the cost we trying to avoid is being forced to go through JS, even for calls to other wasm modules. In this case going through the table is an alternative form of indirection that should be faster?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, will update the wrong comment.

createUnteePass);
registerPass("vacuum", "removes obviously unneeded code", createVacuumPass);
registerPass("ImportsToIndirectCalls",
"turns imports into indirect calls",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "turn calls to imported functions into indirect calls"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to "convert calls to imported functions into indirect calls for the main module"

#include "wasm-io.h"
#include "wasm-printing.h"
#include "wasm-validator.h"
#include "C:\git\binaryen\binaryen\src\passes/passes.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot about this. Will switch to relative paths.

}
}

// Convert the imports to indirects before the Legalization pass to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert calls to imported functions into indirect calls..

Is it the legalization that is the problem or the fact that we have to go through JS at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of them slow down the performance.

std::istreambuf_iterator<char>());

// index.wasm.libfile will have the following format:
// ['_glClearStencil', '_glUniformMatrix2fv', '_eglGetCurrentSurface']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would much simpler if the format was just no symbol per line:

_glClearStencil
_glUniformMatrix2fv
_eglGetCurrentSurface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any maybe calls this a "symbols file" .. index.wasm.symbols

Copy link
Author

@awtcode awtcode Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 'index.wasm.js_symbols' since the file contains JS symbols?


if (!module->table.exists) {
Fatal() << "Table does not exist in module!!!";
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just create on in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure.

void visitCall(Call* curr) {
auto name = curr->target;

if (!name.isNull()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use early return here:

if (name.isNull()) {
  return;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

auto* func = module->getFunction(name);

if (func) {
if (func->imported()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use early return here:

if (func || func->imported()) {
  return;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

src/asm2wasm.h Outdated
passRunner.add("merge-blocks");
passRunner.add("optimize-instructions");
passRunner.add("post-emscripten");
passRunner.add("ImportsToIndirectCalls");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't want to add this here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake. Will remove it.


// Convert the imports to indirects before the Legalization pass to
// avoid unnecessary legalization
if (!options.passOptions.arguments["library-file"].empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this argument a better name too? How about "js-symbols"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.

@awtcode
Copy link
Author

awtcode commented Dec 12, 2019

Is this showing improvements for you?

I have updated the Emscripten PR with the benchmark results. On Chrome, we see a 13x improvement while on FF, we see a 7x improvement.

Are you running it both on the side module and the main module?

We only need to run this on the main module because the previous PR emscripten-core/emscripten#9917 already ensures that we directly call an exported Wasm function rather than the JS stub. This is already faster than what this PR provides, which is to make an indirect call to an exported Wasm function.

In general I think this looks a lot like what we talked about. I'd like to see a test or two added in the test/lld directory to see the effect of this on the code.

Sure, I can always add my benchmark as a testcase to the repo.

Thanks for your work on the Kevin. I don't think we are ready to land this yet but maybe with a few iterations and a little more investigation on our side. @kripken suggested we come up with a design doc for example.

Sounds good, can you point me to an example design doc in case you have a specific format in mind?

@kripken
Copy link
Member

kripken commented Dec 16, 2019

Thinking about this some more, I still wish we could simplify everything - make all calls indirect, for example - but I agree that adds some risk and downsides. So I am mostly ok with this approach, if @sbc100 you think this is maintainable and does not add any problems for migrating to a long-term better model for dynamic linking.

I'd suggest restructuring this as follows: Not run the pass from finalize, and instead, it will be run from emscripten. A pass can take arguments (see e.g. ExtractFunction and Asyncify for examples), so it could have an ignoreList param that has imports to ignore (which would be JS imports).

Then on the Emscripten side, we can run this at the appropriate time. The main BINARYEN_PASSES infrastructure is what drives the asyncify pass, for example, and this might make sense there. Note that this happens after the JS compiler runs, so we don't need to run it twice - the single invocation can provide us the list of JS imports, and we can stash it in an internal setting.

@awtcode
Copy link
Author

awtcode commented Dec 17, 2019

I'd suggest restructuring this as follows: Not run the pass from finalize, and instead, it will be run from emscripten. A pass can take arguments (see e.g. ExtractFunction and Asyncify for examples), so it could have an ignoreList param that has imports to ignore (which would be JS imports).

@kripken , running this pass in finalize has the added benefit of doing the imports to indirects conversion before legalization and that will help in both reducing the size of the binary as well as runtime performance. Should we move the legalization pass out of finalize as well?

@kripken
Copy link
Member

kripken commented Dec 20, 2019

@awtcode Interesting question about the legalize pass. Yes, it seems like this should happen before legalize. In fact it seems like moving legalize to the very end is the most efficient thing in general. I think that's worth trying, although there may be issues I don't remember right now. @sbc100 what do you think?

@kripken
Copy link
Member

kripken commented Jan 10, 2020

I did some investigating of moving legalization all the way to the end of the emcc.py pipeline. The first serious issue I hit was that if we legalize in finalize, which is early, then we find out if we need getTempRet0/setTempRet0 at that time, and can tell the JS compiler to emit it or not. If we legalize late, we'd be forced to always emit those, and rely on optimizations to remove them like metadce. That feels like it might be annoying enough that it's not worth the change, curious what others think.

@sbc100
Copy link
Member

sbc100 commented Jan 10, 2020

If we always emit getTempRet0/setTempRet0 can't meta-dce then remove them?

@kripken
Copy link
Member

kripken commented Jan 13, 2020

Yes, metadce would, so this is not going to increase fully optimized builds. It's still annoying though I think.

cadcode and others added 4 commits January 16, 2020 18:26
Track the beginning and end of each function, both when reading
and writing.

We track expressions and functions separately, instead of having a single
big map of (oldAddr) => (newAddr) because of the potentially ambiguous case
of the final expression in a function: it's end might be identical in offset
to the end of the function. So we have two different things that map to the
same offset. However, if the context is "the end of the function" then the
updated address is the new end of the function, even if the function ends
with a different instruction now, as the old last instruction might have
moved or been optimized out. Concretely, we have getNewExprAddr
and getNewFuncAddr, so we can ask to update the location of either
an expression or a function, and use that contextual information.

This checks for the DIE tag in order to know what we are looking for.
To be safe, if we hit an unknown tag, we halt, so that we don't silently
miss things.

As the test updates show, the new things we can do thanks to this
PR are to update compile unit and subprogram low_pc locations.
Note btw that in the first test (dwarfdump_roundtrip_dwarfdump.bin.txt)
we change 5 to 0: that is correct since that test does not write out
DWARF (it intentionally has no -g), so we do not track binary
locations while writing, and so we have nothing to update to (the
other tests show actual updating).

Also fix the order in the python test runner code to show a diff
of expected to encountered, and not the reverse, which confused
me.
kripken and others added 19 commits January 23, 2020 17:14
This only touches test code. The files are compiled with
latest LLVM + https://reviews.llvm.org/D71681 in order
to get more realistic DWARF content.
This adds EH instruction support for `CFGWalker`. This also implements
`call` instruction handling within a try-catch; every call can possibly
throw and unwind to the innermost catch block.

This adds tests for RedundantSetElimination pass, which uses
`CFGWalker`.
Update high_pc values. These are interesting as they
may be a relative offset compared to the low_pc.

For functions we already had both a start and an end. Add
such tracking for instructions as well.
This will make it easier to switch to something else for
offsets in wasm binaries if we get >4GB files.
Instead of reinventing the wheel on our side, this adds ExpressionAnalyzer
bindings to the C- and JS-APIs, which can be useful for generators. For
example, a generator may decide to simplify a compilation step if a
subexpression doesn't have any side effects, or simply skip emitting
something that is likely to compile to a drop or an empty block right away.
LLVM points to the start of the function in some debug line
entries - right after the size LEB of the function, which is
where the locals are declared, and before any instructions.
It is convenient to have the full command when debugging fuzzing errors.
The fuzzer sometimes fails before running `wasm-reduce` and being able
to reproduce the command right away from the log is very handy in that
case.
Instead of hackishly advancing the read position in the
binary buffer, call readExpression which will do that, and
also do all the debug info handling for us.
Binaryen.js now uses offset instead of byteOffset when inspecting
a memory segment, matching the arguments on memory segment
creation. Also adds inspection of the passive property.

Previously, one would specify { offset, data, passive } on creation
and get back { byteOffset, data } upon inspection. This PR unifies
both to the keys on creation while also adding the respective C-API
to retrieve passive status, which was missing.
Fixes the testcase in WebAssembly#2343 (comment)

Looks like that's from Rust. Not sure why it would have an invalid
abbreviation code, but perhaps the LLVM there emits dwarf differently
than we've tested on so far. May be worth investigating further, but
for now emit a warning, skip that element, and don't crash.

Also fix valgrind warnings about Span values not being initialized,
which was invalid and bad as well (wasted memory in our maps,
and might have overlapped with real values), and interfered with
figuring this out.
…WebAssembly#2603)

Control flow structures have those in addition to the normal span of
(start, end), and we need to track them too.

Tracking them during reading requires us to track control flow
structures while parsing, so that we can know to which structure
an end/else/catch refers to.

We track these locations using a map on the side of instruction
to its "extra" locations. That avoids increasing the size of the
tracking info for the much more common non-control flow
instructions.

Note that there is one more 'end' location, that of the function
(not referring to any instruction). I left that to a later PR to
not increase this one too much.
DWARF from LLVM can refer to the first byte belonging to the function,
where the size LEB is, or to the first byte after that, where the local
declarations are, or the end opcode, or to one byte past that which is
one byte past the bytes that belong to the function. We aren't sure why
LLVM does this, but track it all for now.

After this all debug line positions are identified. However,
in some cases a debug line refers to one past the end of the
function, which may be an LLVM bug. That location is ambiguous
as it could also be the first byte of the next function (what
made this discovery possible was when this happened to the
last function, after which there is another section).
While line and address values of 0 should be skipped, it
seems like column 0 are valid lines emitted by LLVM.
We need to track end_sequence directly, and use either
end_sequence or copy (copy emits a line without marking
it as ending a sequence).

After this, fib2 debug line output looks perfect.
Just some trivial fixes:

* Properly reset prologue after each line (unlike others, this
flag should be reset immediately).

* Test for a function's end address first, as LLVM output appears to
use 1-past-the-end-of-the-function as a location in that function,
and not the next (note the first byte of the next function, which is
ambiguously identical to that value, is used at least in low_pc;
I'm not sure if it's used in debug lines too).

* Ignore the same address if LLVM emitted it more than once, which
it does sometimes.
Pretty straightforward given all we have so far.

Note that fannkuch3_manyopts has an example of
a sequence of ranges of which some must be skipped
while others must not, showing we handle that by
skipping the bad ones and updating the remaining. That
is, if that we have a sequence of two (begin, end) spans

[(10, 20),
 (30, 40)]

It's possible (10, 20) maps in the new binary to (110, 120)
while (30, 40) was eliminated by the optimizer and we have
nothing valid to map it to. In that case we emit

[(110, 120)]
…Assembly#2613)

Chrome is currently decoding the segment indices as signed numbers, so
some ranges of indices greater than 63 do not work. As a temporary
workaround, limit the number of segments produced by MemoryPacking to
63 when bulk-memory is enabled.
# Conflicts:
#	src/wasm/wasm-debug.cpp
#	test/passes/fannkuch3.bin.txt
#	test/passes/fannkuch3_manyopts.bin.txt
@awtcode awtcode closed this Dec 27, 2020
@awtcode
Copy link
Author

awtcode commented Dec 27, 2020

Not required anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants