-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register machine wasmi
execution engine
#729
Conversation
BENCHMARKS
|
Codecov Report
@@ Coverage Diff @@
## master #729 +/- ##
==========================================
+ Coverage 79.42% 81.07% +1.65%
==========================================
Files 105 270 +165
Lines 9075 23217 +14142
==========================================
+ Hits 7208 18824 +11616
- Misses 1867 4393 +2526
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
wasmi
execution engine (take 2)wasmi
execution engine (take 2)
These are (probably) more efficient than their ReturnMany and ReturnNezMany respective counterparts because they store the returned registers inline.
This also tests the new Return2 and Return3 instructions.
All call instructions now uniformly require their parameters to be placed in contiguous register spans. This necessitates copy instructions before a call is initiated in some cases. Future plans include to optimise longer sequences of copy instructions but we left that optimisation out for now.
They now have the same form as their nested call counterparts.
This is missing translation tests for now.
We do this by avoiding or at least limiting the procedure to a conservative subset of all instructions that could have been affected by the register space fragmentation.
As discussed with @athei I will merge this PR now and start working on the remaining TODO items in isolation via follow-up PRs. For this process I am going to write a bunch of issues to track their progress. The register-machine backend introduced by this PR passes the entire Wasm spec testsuite. However, this does not mean it is bug free and stable. I actually am aware of a few bugs that are going to be fixed soon. The roadmap of
|
wasmi
execution engine (take 2)wasmi
execution engine
Closes #361.
Precursor: #367
ToDo
These items are things we want to do before merging the PR:
For function calls we need to setup the arguments so that they are all stored in contiguous registers. Currently we do this for function calls with exactly 1 argument even though this is not needed. Thus during translation we should implement a check and simplify (and thus optimize) function call argument encoding for those function calls.
ConsumeFuel
instructions and their fixed costs.So far we have concentrated on getting the Wasm to
wasmi
bytecode translation up and running with the most simple setup possible. This means we ignored translation ofConsumeFuel
instructions and their associated fuel costs so far and are in need of doing this work before we merge the PR. There should not be any difficulties compared to the already existing implementation for the stack-machine engine backend.local.set
stack preservation:Currently for every
local.set
orlocal.tee
that we encounter while translating Wasm bytecode towasmi
bytecode we iterate all values on the emulated value stack. Usually this isn't a big deal since the stack usually doesn't grow big in practical workloads. However, this can be easily attacked by a Wasm blob that blows up the stack and then performs tons oflocal.set
instructions. We can eliminate this attack vector by storing thelocal.get
provider indices on the emulated stack on a separate stack and iterate on it instead. While iterating we also remove thelocal.get
indices from the stack so that a consecutivelocal.set
operation will see an empty stack and thus perform no operations. We simply cache thelocal.get
providers this way.local.set
result replacement optimization when preservinglocal.get
at the same time:When translating
local.set
orlocal.tee
we replace the result register of the previous instruction instead of emitting a copy instruction if possible. However, when preserving alocal.get
on the emulation stack we do not perform this optimization. The problem is that the copy instruction required for the preservation is required to take place before the instructioni
that should have its result register replaced. However, the instructioni
is already encoded and could in the worst case consist of multiple instruction words which would require to shift already encoded instructions by one index. In theory this could also interfere with pre-calculated branch offset calculations but so far this has not been demonstrated and might not be true.After successful translation of Wasm bytecode the
wasmi
translation performs a final pass over all encoded instructions to defragment the register space. This is needed for registers that have been preserved forlocal.set
in certain situations. However, this entire process is very costly and can be avoided entirely or partially. We can avoid it entirely since this only ever needs to be done if there were actual register preservations during the translation procesure. Furthermore we only need to defragment all instructions that have been encoded after encountering the firstlocal.set
register preservation. This way we can keep the simple loop over all instructions but still avoid most of the unnecessary work and thus speed up the translation performance.Currently, when preserving
local.get x
values on the stack upon alocal.set x
translation, a new register is reserved for the Nlocal.get x
values that have been found and replaced on the emulated value stack at the time of translation. When this happens multiple times, a new register slot on the preservation stack is registered each time. Right now, we do not track how many of the preservedlocal.get x
values have already been used. However, if we would do this we could recycle no longer used preservation register slots instead of allocating a new slot all the time which could lead to fewer registers used especially by larger functions. A major data structure that would allow this efficiently in O(1) is the so-called Stash data structure.Plan & Steps
wasmi
executor.unreachable
instructionselect
andselect <ty>
Instructionsi32.{eq, eqz, ne, lt_{s|u}, le_{s|u}, gt_{s|u}, ge_{s|u}}
instructionsi64.{eq, eqz, ne, lt_{s|u}, le_{s|u}, gt_{s|u}, ge_{s|u}}
instructionsf32.{eq, ne, lt, le, gt, ge}
instructionsf64.{eq, ne, lt, le, gt, ge}
instructionsload
Instructions (or equivalents)i32.load
andi32.loadN_{s|u}
instructionsi64.load
andi64.loadN_{s|u}
instructionsf32.load
instructionf64.load
instructionstore
instructions (or equivalents)i32.store
andi32.storeN
instructionsi64.store
andi64.storeN
instructionsf32.store
instructionf64.store
instructioni32
compute instructions, e.g.i32.popcnt
,i32.add
,i32.rotl
etc..i64
compute instructions, e.g.i64.popcnt
,i64.add
,i32.rotl
etc..f32
compute instructions, e.g.f32.sqrt
,f32.add
, etc..f64
compute instructions, e.g.f32.sqrt
,f32.add
, etc..sign-extension
proposal instructionsnon-trapping float-to-int conversion
proposal instructionsglobal.get
global.set
(and immediate versions)table
instructionstable.size
instructiontable.grow
instructiontable.get
instructiontable.set
instructiontable.fill
instructiontable.copy
instructiontable.init
instructionmemory
instructionsmemory.size
instructionmemory.grow
instructionmemory.fill
instructionmemory.copy
instructionmemory.init
instructionwasmi
bytecodewasmi
instructions.block
control flowloop
control flowif
control flowbr_table
select
(from Wasm MVP)select (result <ty>)
(fromreference-types
proposal)drop
instructionunreachable
instructionlocal.set
andlocal.tee
i32.{eq, ne, eqz}
instructionsi64.{eq, ne, eqz}
instructionsf32.{eq, ne}
instructionsf64.{eq, ne}
instructionsi32.{lt_s, lt_u, le_s, le_u, gt_s, gt_u, ge_s, ge_u}
instructionsi64.{lt_s, lt_u, le_s, le_u, gt_s, gt_u, ge_s, ge_u}
instructionsf32.{lt, le, gt, ge}
instructionsf64.{lt, le, gt, ge}
instructionsload
Instructions (or equivalents)i32.load
andi32.loadN_{s|u}
instructionsi64.load
andi64.loadN_{s|u}
instructionsf32.load
instructionf64.load
instructionstore
instructions (or equivalents)i32.store
andi32.storeN
instructionsi64.store
andi64.storeN
instructionsf32.store
instructionf64.store
instructioni32.{clz, ctz, popcnt}
i64.{clz, ctz, popcnt}
f32.{abs, neg, ceil, floor, trunc, nearest, sqrt}
f64.{abs, neg, ceil, floor, trunc, nearest, sqrt}
i32.{add, mul, and, or, xor}
i64.{add, mul, and, or, xor}
f32.{add, mul, min, max}
f64.{add, mul, min, max}
i32.sub
i64.sub
f32.{sub, div, copysign}
f64.{sub, div, copysign}
i32.{shl, shr_s, shr_u, rotl, rotr}
i64.{shl, shr_s, shr_u, rotl, rotr}
i32.{div_u, div_s, rem_u, rem_s}
i64.{div_u, div_s, rem_u, rem_s}
sign-extension
proposal instructionsnon-trapping float-to-int conversion
proposal instructionsglobal.get
instructionglobal.set
instructionreftype
instructionsref.null
ref.is_null
ref.func
table
instructionstable.size
instructiontable.grow
instructiontable.get
instructiontable.set
instructiontable.fill
instructiontable.copy
instructiontable.init
instructionelem.drop
instructionmemory
instructionsmemory.size
instructionmemory.grow
instructionmemory.fill
instructionmemory.copy
instructionmemory.init
instructiondata.drop
instructionwasmi
register-machine bytecoderef.func
instructionload
instructionsstore
instructionstable
instructionstable.get
instructionstable.set
instructionstable.size
instructiontable.copy
instructiontable.init
instructiontable.fill
instructiontable.grow
instructionelem.drop
instructionmemory
instructionsmemory.size
instructionmemory.copy
instructionmemory.init
instructionmemory.fill
instructionmemory.grow
instructiondata.drop
instructionglobal
instructionsglobal.get
instructionsglobal.set
instructionsi32
instructionsi64
instructionsf32
instructionsf64
instructionsi32
instructionsi64
instructionsf32
instructionsf64
instructionsUnresolved Questions
Ideas
The following list contains ideas that spun up and might be iterated here for experimentation purposes.
br 0
in Wasm basicblock
control frames should not be translated as a branch and instead as a Wasmend
of the basicblock
since all code after thebr 0
is unreachable. Fortunately this bytecode sequence seems to not be very common in practical Wasm blobs.ConstRef
in order to store the actual value in a const pool which is external to the bytecode itself. This has several downsides for performance and also for bytecode integrity. Performance is affected since an additional indirect memory fetch is required in order to compute on the constant value. Bytecode integrity is worse since analysingwasmi
bytecode now also needs to inspect the external const pool. The latter point affects testability ofwasmi
bytecode. A way to improve this situation in both areas is to store the 64-bit constants inline. The problem is that for this 8 bytes are required but instruction words only support up to 6 bytes of parameters per word. Thus there is a need to split up the 64-bit constant into multiple pieces. The experiment in this GitHub Gist shows that a solution that splits the 64-bit constant value into 3 pieces (2 x 2-byte and 1 x 4-byte pieces) can be done efficiently on x86 and Wasm platforms. However, further benchmark tests are required to proof this.{i32, i64}.{div_u, div_s, rem_u, rem_s}
we can apply a bytecode encoding optimization for the cases where the right-hand side divisor is a constant value. During translation we guarantee for all those operations that the right-hand side constant value is non-zero due to the fact that a zero right-hand side value is translated as atrap
instruction during translation. Therefore we can replace theConst32
orConst16
parameter of those instructions with aNonZero32
orNonZero16
value and use Rust's built-inDiv<NonZeroU{32,64}> for u{32,64}
andRem<NonZeroU{32,64}> for u{32,64}
. Note that those APIs are only available for unsigned integers.i32.add_assign
. The advantage is that we no longer have to store both,result
andlhs
fields since they are always the same and always of typeRegister
. This also allows to have inlineConst32
orConstRef
rhs
fields and thus we could save some encoding space, too. Another benefit is that the instruction is probably also a bit faster at execution since the compiler has more information about which registers to read to and write from. However, this has to be checked with experiments. A downside of having these op-assign instructions is that they do not play well withlocal.set
andlocal.tee
optimizations where we replace theresult
register of the previous instruction during translation phaseAddAssign
or+=
operator:I32AddAssign(UnaryInstr)
I32AddAssignImm(UnaryInstrImm32)
: Requires just 1Instruction
for encoding.I64AddAssign(UnaryInstr)
I64AddAssignImm(UnaryInstrImm)
: Requires just 1Instruction
for encoding.I64AddAssignImm32(UnaryInstrImm32)
: 32-bit small value optimization instead of 16-bitF32AddAssign(UnaryInstr)
F32AddAssignImm(UnaryInstrImm32)
: Requires just 1Instruction
for encoding.F64AddAssign(UnaryInstr)
F64AddAssignImm(UnaryInstrImm)
: Requires just 1Instruction
for encoding.global.op_assign
instructionsglobal.get g; i32.add; global.set g
which we then could represent using a singlewasmi
instructions such asglobal.i32.add g r
orglobal.i32.add_imm g c
whereg
represents a global,r
represents an input register andc
a constant value. Further research is needed to find out how common these sequences are and if the proposedglobal.op_assign
instruction are actually improving execution performance significantly.wasmi
bytecode. For examplespidermonkey.wasm
contains exactly a single global variable that is only sparsely used thoughout the Wasm file.load
andstore
instructions.load
andstore
instructions that oddly compute aptr+offset
orptr*scale + offset
or evenptr shift scale + offset
outside of Wasm'sload
andstore
instructions. Technically we can makeload
andstore
instructions very powerful by including simple pointer arithmetic into these instructions.