Skip to content

Commit

Permalink
Implement lazy funcref table and anyfunc initialization.
Browse files Browse the repository at this point in the history
During instance initialization, we build two sorts of arrays eagerly:

- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
  in an instance.

- We initialize every element of a funcref table with an initializer to
  a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

- The anyfunc array can be lazily initialized as long as we retain the
  information needed to do so. For now, in this PR, we just recreate the
  anyfunc whenever a pointer is taken to it, because doing so is fast
  enough; in the future we could keep some state to know whether the
  anyfunc has been written yet and skip this work if redundant.

  This technique allows us to leave the anyfunc array as uninitialized
  memory, which can be a significant savings. Filling it with
  initialized anyfuncs is very expensive, but even zeroing it is
  expensive: e.g. in a large module, it can be >500KB.

- A funcref table can be lazily initialized as long as we retain a link
  to its corresponding instance and function index for each element. A
  zero in a table element means "uninitialized", and a slowpath does the
  initialization.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.

This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.

Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:

```
BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.
```

So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
  • Loading branch information
cfallin committed Feb 9, 2022
1 parent 1019855 commit e7ed67e
Show file tree
Hide file tree
Showing 26 changed files with 1,006 additions and 440 deletions.
2 changes: 1 addition & 1 deletion cranelift/wasm/src/code_translator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -612,7 +612,7 @@ pub fn translate_operator<FE: FuncEnvironment + ?Sized>(
bitcast_arguments(args, &types, builder);

let call = environ.translate_call_indirect(
builder.cursor(),
builder,
TableIndex::from_u32(*table_index),
table,
TypeIndex::from_u32(*index),
Expand Down
20 changes: 10 additions & 10 deletions cranelift/wasm/src/environ/dummy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ impl<'dummy_environment> FuncEnvironment for DummyFuncEnvironment<'dummy_environ

fn translate_call_indirect(
&mut self,
mut pos: FuncCursor,
builder: &mut FunctionBuilder,
_table_index: TableIndex,
_table: ir::Table,
_sig_index: TypeIndex,
Expand All @@ -413,7 +413,7 @@ impl<'dummy_environment> FuncEnvironment for DummyFuncEnvironment<'dummy_environ
call_args: &[ir::Value],
) -> WasmResult<ir::Inst> {
// Pass the current function's vmctx parameter on to the callee.
let vmctx = pos
let vmctx = builder
.func
.special_param(ir::ArgumentPurpose::VMContext)
.expect("Missing vmctx parameter");
Expand All @@ -423,22 +423,22 @@ impl<'dummy_environment> FuncEnvironment for DummyFuncEnvironment<'dummy_environ
// TODO: Generate bounds checking code.
let ptr = self.pointer_type();
let callee_offset = if ptr == I32 {
pos.ins().imul_imm(callee, 4)
builder.ins().imul_imm(callee, 4)
} else {
let ext = pos.ins().uextend(I64, callee);
pos.ins().imul_imm(ext, 4)
let ext = builder.ins().uextend(I64, callee);
builder.ins().imul_imm(ext, 4)
};
let mflags = ir::MemFlags::trusted();
let func_ptr = pos.ins().load(ptr, mflags, callee_offset, 0);
let func_ptr = builder.ins().load(ptr, mflags, callee_offset, 0);

// Build a value list for the indirect call instruction containing the callee, call_args,
// and the vmctx parameter.
let mut args = ir::ValueList::default();
args.push(func_ptr, &mut pos.func.dfg.value_lists);
args.extend(call_args.iter().cloned(), &mut pos.func.dfg.value_lists);
args.push(vmctx, &mut pos.func.dfg.value_lists);
args.push(func_ptr, &mut builder.func.dfg.value_lists);
args.extend(call_args.iter().cloned(), &mut builder.func.dfg.value_lists);
args.push(vmctx, &mut builder.func.dfg.value_lists);

Ok(pos
Ok(builder
.ins()
.CallIndirect(ir::Opcode::CallIndirect, INVALID, sig_ref, args)
.0)
Expand Down
2 changes: 1 addition & 1 deletion cranelift/wasm/src/environ/spec.rs
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ pub trait FuncEnvironment: TargetEnvironment {
#[cfg_attr(feature = "cargo-clippy", allow(clippy::too_many_arguments))]
fn translate_call_indirect(
&mut self,
pos: FuncCursor,
builder: &mut FunctionBuilder,
table_index: TableIndex,
table: ir::Table,
sig_index: TypeIndex,
Expand Down
132 changes: 99 additions & 33 deletions crates/cranelift/src/func_environ.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use cranelift_codegen::cursor::FuncCursor;
use cranelift_codegen::ir;
use cranelift_codegen::ir::condcodes::*;
use cranelift_codegen::ir::immediates::{Offset32, Uimm64};
use cranelift_codegen::ir::immediates::{Imm64, Offset32, Uimm64};
use cranelift_codegen::ir::types::*;
use cranelift_codegen::ir::{AbiParam, ArgumentPurpose, Function, InstBuilder, Signature};
use cranelift_codegen::isa::{self, TargetFrontendConfig, TargetIsa};
Expand All @@ -19,6 +19,7 @@ use wasmtime_environ::{
BuiltinFunctionIndex, MemoryPlan, MemoryStyle, Module, ModuleTranslation, TableStyle, Tunables,
TypeTables, VMOffsets, INTERRUPTED, WASM_PAGE_SIZE,
};
use wasmtime_environ::{FUNCREF_INIT_BIT, FUNCREF_MASK};

/// Compute an `ir::ExternalName` for a given wasm function index.
pub fn get_func_name(func_index: FuncIndex) -> ir::ExternalName {
Expand Down Expand Up @@ -750,6 +751,59 @@ impl<'module_environment> FuncEnvironment<'module_environment> {
pos.ins().uextend(I64, val)
}
}

fn get_or_init_funcref_table_elem(
&mut self,
builder: &mut FunctionBuilder,
table_index: TableIndex,
table: ir::Table,
index: ir::Value,
) -> ir::Value {
let pointer_type = self.pointer_type();

// To support lazy initialization of table
// contents, we check for a null entry here, and
// if null, we take a slow-path that invokes a
// libcall.
let table_entry_addr = builder.ins().table_addr(pointer_type, table, index, 0);
let value = builder
.ins()
.load(pointer_type, ir::MemFlags::trusted(), table_entry_addr, 0);
// Mask off the "initialized bit". See documentation on
// FUNCREF_INIT_BIT in crates/environ/src/ref_bits.rs for more
// details.
let value_masked = builder
.ins()
.band_imm(value, Imm64::from(FUNCREF_MASK as i64));

let null_block = builder.create_block();
let continuation_block = builder.create_block();
let result_param = builder.append_block_param(continuation_block, pointer_type);
builder.set_cold_block(null_block);

builder.ins().brz(value, null_block, &[]);
builder.ins().jump(continuation_block, &[value_masked]);
builder.seal_block(null_block);

builder.switch_to_block(null_block);
let table_index = builder.ins().iconst(I32, table_index.index() as i64);
let builtin_idx = BuiltinFunctionIndex::table_get_lazy_init_funcref();
let builtin_sig = self
.builtin_function_signatures
.table_get_lazy_init_funcref(builder.func);
let (vmctx, builtin_addr) =
self.translate_load_builtin_function_address(&mut builder.cursor(), builtin_idx);
let call_inst =
builder
.ins()
.call_indirect(builtin_sig, builtin_addr, &[vmctx, table_index, index]);
let returned_entry = builder.func.dfg.inst_results(call_inst)[0];
builder.ins().jump(continuation_block, &[returned_entry]);
builder.seal_block(continuation_block);

builder.switch_to_block(continuation_block);
result_param
}
}

impl<'module_environment> TargetEnvironment for FuncEnvironment<'module_environment> {
Expand Down Expand Up @@ -886,13 +940,7 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
match plan.table.wasm_ty {
WasmType::FuncRef => match plan.style {
TableStyle::CallerChecksSignature => {
let table_entry_addr = builder.ins().table_addr(pointer_type, table, index, 0);
Ok(builder.ins().load(
pointer_type,
ir::MemFlags::trusted(),
table_entry_addr,
0,
))
Ok(self.get_or_init_funcref_table_elem(builder, table_index, table, index))
}
},
WasmType::ExternRef => {
Expand Down Expand Up @@ -1033,9 +1081,18 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
WasmType::FuncRef => match plan.style {
TableStyle::CallerChecksSignature => {
let table_entry_addr = builder.ins().table_addr(pointer_type, table, index, 0);
builder
// Set the "initialized bit". See doc-comment on
// `FUNCREF_INIT_BIT` in
// crates/environ/src/ref_bits.rs for details.
let value_with_init_bit = builder
.ins()
.store(ir::MemFlags::trusted(), value, table_entry_addr, 0);
.bor_imm(value, Imm64::from(FUNCREF_INIT_BIT as i64));
builder.ins().store(
ir::MemFlags::trusted(),
value_with_init_bit,
table_entry_addr,
0,
);
Ok(())
}
},
Expand Down Expand Up @@ -1253,10 +1310,16 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
mut pos: cranelift_codegen::cursor::FuncCursor<'_>,
func_index: FuncIndex,
) -> WasmResult<ir::Value> {
let vmctx = self.vmctx(&mut pos.func);
let vmctx = pos.ins().global_value(self.pointer_type(), vmctx);
let offset = self.offsets.vmctx_anyfunc(func_index);
Ok(pos.ins().iadd_imm(vmctx, i64::from(offset)))
let func_index = pos.ins().iconst(I32, func_index.as_u32() as i64);
let builtin_index = BuiltinFunctionIndex::ref_func();
let builtin_sig = self.builtin_function_signatures.ref_func(&mut pos.func);
let (vmctx, builtin_addr) =
self.translate_load_builtin_function_address(&mut pos, builtin_index);

let call_inst = pos
.ins()
.call_indirect(builtin_sig, builtin_addr, &[vmctx, func_index]);
Ok(pos.func.dfg.first_result(call_inst))
}

fn translate_custom_global_get(
Expand Down Expand Up @@ -1459,7 +1522,7 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m

fn translate_call_indirect(
&mut self,
mut pos: FuncCursor<'_>,
builder: &mut FunctionBuilder,
table_index: TableIndex,
table: ir::Table,
ty_index: TypeIndex,
Expand All @@ -1469,21 +1532,17 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
) -> WasmResult<ir::Inst> {
let pointer_type = self.pointer_type();

let table_entry_addr = pos.ins().table_addr(pointer_type, table, callee, 0);

// Dereference the table entry to get the pointer to the
// `VMCallerCheckedAnyfunc`.
let anyfunc_ptr =
pos.ins()
.load(pointer_type, ir::MemFlags::trusted(), table_entry_addr, 0);
// Get the anyfunc pointer (the funcref) from the table.
let anyfunc_ptr = self.get_or_init_funcref_table_elem(builder, table_index, table, callee);

// Check for whether the table element is null, and trap if so.
pos.ins()
builder
.ins()
.trapz(anyfunc_ptr, ir::TrapCode::IndirectCallToNull);

// Dereference anyfunc pointer to get the function address.
let mem_flags = ir::MemFlags::trusted();
let func_addr = pos.ins().load(
let func_addr = builder.ins().load(
pointer_type,
mem_flags,
anyfunc_ptr,
Expand All @@ -1495,36 +1554,41 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
TableStyle::CallerChecksSignature => {
let sig_id_size = self.offsets.size_of_vmshared_signature_index();
let sig_id_type = Type::int(u16::from(sig_id_size) * 8).unwrap();
let vmctx = self.vmctx(pos.func);
let base = pos.ins().global_value(pointer_type, vmctx);
let vmctx = self.vmctx(builder.func);
let base = builder.ins().global_value(pointer_type, vmctx);
let offset =
i32::try_from(self.offsets.vmctx_vmshared_signature_id(ty_index)).unwrap();

// Load the caller ID.
let mut mem_flags = ir::MemFlags::trusted();
mem_flags.set_readonly();
let caller_sig_id = pos.ins().load(sig_id_type, mem_flags, base, offset);
let caller_sig_id = builder.ins().load(sig_id_type, mem_flags, base, offset);

// Load the callee ID.
let mem_flags = ir::MemFlags::trusted();
let callee_sig_id = pos.ins().load(
let callee_sig_id = builder.ins().load(
sig_id_type,
mem_flags,
anyfunc_ptr,
i32::from(self.offsets.vmcaller_checked_anyfunc_type_index()),
);

// Check that they match.
let cmp = pos.ins().icmp(IntCC::Equal, callee_sig_id, caller_sig_id);
pos.ins().trapz(cmp, ir::TrapCode::BadSignature);
let cmp = builder
.ins()
.icmp(IntCC::Equal, callee_sig_id, caller_sig_id);
builder.ins().trapz(cmp, ir::TrapCode::BadSignature);
}
}

let mut real_call_args = Vec::with_capacity(call_args.len() + 2);
let caller_vmctx = pos.func.special_param(ArgumentPurpose::VMContext).unwrap();
let caller_vmctx = builder
.func
.special_param(ArgumentPurpose::VMContext)
.unwrap();

// First append the callee vmctx address.
let vmctx = pos.ins().load(
let vmctx = builder.ins().load(
pointer_type,
mem_flags,
anyfunc_ptr,
Expand All @@ -1536,7 +1600,9 @@ impl<'module_environment> cranelift_wasm::FuncEnvironment for FuncEnvironment<'m
// Then append the regular call arguments.
real_call_args.extend_from_slice(call_args);

Ok(pos.ins().call_indirect(sig_ref, func_addr, &real_call_args))
Ok(builder
.ins()
.call_indirect(sig_ref, func_addr, &real_call_args))
}

fn translate_call(
Expand Down
4 changes: 4 additions & 0 deletions crates/environ/src/builtin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,12 @@ macro_rules! foreach_builtin_function {
memory_fill(vmctx, i32, i64, i32, i64) -> ();
/// Returns an index for wasm's `memory.init` instruction.
memory_init(vmctx, i32, i32, i64, i32, i32) -> ();
/// Returns a value for wasm's `ref.func` instruction.
ref_func(vmctx, i32) -> (pointer);
/// Returns an index for wasm's `data.drop` instruction.
data_drop(vmctx, i32) -> ();
/// Returns a table entry after lazily initializing it.
table_get_lazy_init_funcref(vmctx, i32, i32) -> (pointer);
/// Returns an index for Wasm's `table.grow` instruction for `funcref`s.
table_grow_funcref(vmctx, i32, i32, pointer) -> (i32);
/// Returns an index for Wasm's `table.grow` instruction for `externref`s.
Expand Down
2 changes: 2 additions & 0 deletions crates/environ/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ mod compilation;
mod module;
mod module_environ;
pub mod obj;
mod ref_bits;
mod stack_map;
mod trap_encoding;
mod tunables;
Expand All @@ -39,6 +40,7 @@ pub use crate::builtin::*;
pub use crate::compilation::*;
pub use crate::module::*;
pub use crate::module_environ::*;
pub use crate::ref_bits::*;
pub use crate::stack_map::StackMap;
pub use crate::trap_encoding::*;
pub use crate::tunables::Tunables;
Expand Down
Loading

0 comments on commit e7ed67e

Please sign in to comment.