Optimization: Opcode fusion of branch and comparison instructions #712

Robbepop · 2023-03-30T09:10:36Z

Blocked by: #729

In the current wasmi bytecode comparison instructions such as I32LtS, I64GtU etc. and conditional branch instructions such as BrIfNez and BrIfEqz could be fused since most often they are used together. Executing fewer instructions usually results in improved performance for interpreters like wasmi.

For example the following Wasm bytecode:

(br_if
    0 ;; some branch target
    (i32.lt_s
        (local.get 0)
        (local.get 1)
    )
)

Which represents something similar to:

if get_local(0) as i32 < get_local(1) as i32 {
    break; // goto branch target
}

Could be reduced to the following new wasmi instruction:

(br.i32.lt_s
    (local.get 0)
    (local.get 1)
)

Of course this means that we need to add plenty of new wasmi bytecode instructions which is a big drawback to this potential optimizations. Due to the increased number of instructions it is also unclear whether this optimization actually improves performance. We are not able to replace old unfused wasmi instructions by the new wasmi instructions since they might still occur in isolation.

Bytecode Changes

Compare + Branch

New wasmi bytecode branch instructions include:

br.{i32, i64, f32, f64}.{eq, ne}: 8 new instructions.
br.{i32, i64}.{lt, le, gt, ge}{_s, _u}: 16 new instructions.
br.{f32, f64}.{lt, le, gt, ge}: 8 new instructions.

In total: 32 new branch+compare instructions. Furthermore for immediate versions we are going to need yet another 32 branch+compare with immediate instructions. Totalling 64 instructions.

Note: We only need immediate versions for the second operand since we can rewrite bytecode where the first operand is an immediate value always as follows: br.i32.lt_s 42 reg0 -> br.i32.gt_s reg0 42.

Note: Wasms br_eqz can be represented by br.{i32, i64}.eq_imm reg 0. Where the _imm suffix represents a bytecode instruction that takes an immediate value as its second operator.

Compare + Return

Beyond branch instructions we could and probably should also introduce new return instruction variants for all the comparators since a branch to the outermost level is simply a return instruction and there already exist conditional return instructions in return.nez and return.eqz in wasmi which act as living examples.

return.{i32, i64, f32, f64}.{eq, ne}: 8 new instructions.
return.{i32, i64}.{lt, le, gt, ge}{_s, _u}: 16 new instructions.
return.{f32, f64}.{lt, le, gt, ge}: 8 new instructions.

Which is yet another 32 new instructions.

Blocked By

This optimization is blocked by the implementation of the register machine bytecode.

Conclusion

The biggest advantage of this potential optimization is that those fused instruction sequences usually occur within loops where they would have the biggest positive impact on the performance of the executed Wasm program.

The text was updated successfully, but these errors were encountered:

Robbepop · 2023-07-06T18:24:09Z

Today I thought a bit about this issue and since this is "just" an optimization based on the Instruction type of the wasmi register machine implementation I came up with the following idea:

What if we just provide the Instruction variants that are actually going to be very efficient at runtime?
This will leave us with the following variants:

enum Instruction {
    ...
    BrI32Eq { .. },
    BrI32EqImm { .. },
    BrI64Eq { .. },
    BrI64EqImm { .. },
    BrF32Eq { .. },
    BrF32EqImm { .. },
    BrF64Eq { .. },
    BrF64EqImm { .. },

    BrI32Ne { .. },
    BrI32NeImm { .. },
    BrI64Ne { .. },
    BrI64NeImm { .. },
    BrF32Ne { .. },
    BrF32NeImm { .. },
    BrF64Ne { .. },
    BrF64NeImm { .. },

    BrI32LtS { .. },
    BrI32LtSImm { .. },
    BrI32GtS { .. },
    BrI32GtSImm { .. },
    BrI32LeS { .. },
    BrI32LeSImm { .. },
    BrI32GeS { .. },
    BrI32GeSImm { .. },

    BrI64LtS { .. },
    BrI64LtSImm { .. },
    BrI64GtS { .. },
    BrI64GtSImm { .. },
    BrI64LeS { .. },
    BrI64LeSImm { .. },
    BrI64GeS { .. },
    BrI64GeSImm { .. },

    BrF32Lt { .. },
    BrF32Gt { .. },
    BrF32Le { .. },
    BrF32Ge { .. },

    BrF64Lt { .. },
    BrF64Gt { .. },
    BrF64Le { .. },
    BrF64Ge { .. },
    ...
}

With this internal structure as represented exemplary by BrI32Eq and BrI32EqImm:

BrI32Eq {
    lhs: Register,
    rhs: Register,
    offset: BranchOffset16,
},
BrI32EqImm {
    lhs: Register,
    rhs: Const16,
    offset: BranchOffset16,
},

This implies that we only apply this optimization if BranchOffset (which usually is encoded as 32-bit value) can be encoded as 16-bit value AND in case of instructions with an immutable rhs if the rhs value can also be encoded as 16-bit value.

For example this holds true for the common cases of Wasm's i32.eqz or i64.eqz which are both translated to

I{32,64}EqImm16 { result: Register, lhs: Register, imm: Const16::from(0) }

And if followed by a br_if could be optimized to

BrI{32,64}EqImm { lhs: Register, imm: Const16::from(0) }

For most common and practical use cases I assume that a BranchOffset16 will nearly always be encodable so the only real limitation are the 16-bit encoded immutable values.

Since return instructions are more complex than br instructions in the linked PR implementation I propose NOT to optimize conditional return instructions this way.

Robbepop · 2023-11-18T08:57:45Z

The register-machine wasmi engine backend has been implemented. It is not yet finalized and likely still contains bugs but I was just running some coremark.wasm benchmarks while counting the executions per wasmi instruction and this is the result:
https://gist.github.com/Robbepop/16ac42197e1c745dd5ad3ab41524ae2d

We can clearly see that conditional branches (branch_nez and branch_eqz) are the most common instruction executed. This might indicate that this optimization might yield valuable results.

The wasmi branch that implements instruction execution counting can be found here:
https://github.com/paritytech/wasmi/tree/rf-count-instruction-executions

Robbepop changed the title ~~Optimization: Opcode fusion of brach and comparison instructions~~ Optimization: Opcode fusion of branch and comparison instructions Mar 30, 2023

Robbepop added the optimization An performance optimization issue. label Mar 30, 2023

Robbepop mentioned this issue Jun 2, 2023

Register machine wasmi execution engine #729

Merged

Robbepop added the register-machine A work item for the register-machine engine. label Sep 22, 2023

Robbepop self-assigned this Sep 22, 2023

Robbepop mentioned this issue Nov 19, 2023

Implement cmp+branch instruction fusion #789

Merged

7 tasks

Robbepop closed this as completed in #789 Nov 20, 2023

Robbepop mentioned this issue Nov 21, 2023

Add i32.{and,or,xor} [+ i32.eqz] + br_if instruction fusion #791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Opcode fusion of branch and comparison instructions #712

Optimization: Opcode fusion of branch and comparison instructions #712

Robbepop commented Mar 30, 2023 •

edited

Loading

Robbepop commented Jul 6, 2023 •

edited

Loading

Robbepop commented Nov 18, 2023 •

edited

Loading

Optimization: Opcode fusion of branch and comparison instructions #712

Optimization: Opcode fusion of branch and comparison instructions #712

Comments

Robbepop commented Mar 30, 2023 • edited Loading

Bytecode Changes

Compare + Branch

Compare + Return

Blocked By

Conclusion

Robbepop commented Jul 6, 2023 • edited Loading

Robbepop commented Nov 18, 2023 • edited Loading

Robbepop commented Mar 30, 2023 •

edited

Loading

Robbepop commented Jul 6, 2023 •

edited

Loading

Robbepop commented Nov 18, 2023 •

edited

Loading