Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling a 8 MB WASM file using Wasmtime dev build takes 20x as long as with Wasmtime 3.0.0 #5328

Closed
kpreisser opened this issue Nov 26, 2022 · 16 comments · Fixed by #5335
Closed

Comments

@kpreisser
Copy link
Contributor

kpreisser commented Nov 26, 2022

Hi!

I noticed that compiling an 8 MiB WASM module (produced by the experimental WASI SDK for .NET) when using Wasmtime from commit 28cf995 (current dev release) takes much longer than when using Wasmtime 3.0.0 (tested on Windows 10 x64, Build 19044).

To reproduce:

On my system, with Wasmtime 3.0.0 this takes about 0.7 seconds, but with the build from commit 28cf995 it takes about 13.8 seconds, which is about 20 times as long.
Is this expected?

Bisecting shows that this starts to appear with commit fc62d4a (#5231) - with a release build from this commit, I get the 13.8s time, but with the previous commit 33a1925, I get a time of about 0.7s.

Versions and Environment

Wasmtime version or commit: Wasmtime v3.0.0 vs 28cf995
Operating system: Windows 10 Pro x64, Build 19044
Architecture: x64 (x86-64)

Thank you!

@kpreisser kpreisser changed the title Cranelift compile time of 8 MB WASM file using current main is 20x as slow as with Wasmtime 3.0.0 Cranelift compile time of 8 MB WASM file using dev build is 20x as slow as with Wasmtime 3.0.0 Nov 26, 2022
@kpreisser kpreisser changed the title Cranelift compile time of 8 MB WASM file using dev build is 20x as slow as with Wasmtime 3.0.0 Cranelift compile time of 8 MB WASM file using Wasmtime dev build is 20x as slow as with Wasmtime 3.0.0 Nov 26, 2022
@kpreisser kpreisser changed the title Cranelift compile time of 8 MB WASM file using Wasmtime dev build is 20x as slow as with Wasmtime 3.0.0 Compiling a 8 MB WASM file using Wasmtime dev build takes 20x as long as with Wasmtime 3.0.0 Nov 27, 2022
@kpreisser
Copy link
Contributor Author

kpreisser commented Nov 28, 2022

Additionally, starting with the above mentioned commit 28cf995, we get an access violation in our application in wasmtime_func_call (via Wasmtime.Dotnet, for wasm function "_start") when using the above mentioned dotnet-codabix-wasm-build.wasm (on Windows 10 x64 buil 19044) (Output/Callstack from Visual Studio 2022):

Exception thrown at 0x000001C8322DBC49 in codabix.exe: 0xC0000005: Access violation writing location 0x0000007915423870.

>	000001c8322dbc49()	Unknown
 	000001c8322db7f4()	Unknown
 	000001c832c4b87c()	Unknown
 	000001c83280b082()	Unknown
 	000001c832811490()	Unknown
 	000001c83280d273()	Unknown
 	000001c8323e68fc()	Unknown
 	000001c8323019df()	Unknown
 	000001c8322db7f4()	Unknown
 	000001c832c4b87c()	Unknown
 	000001c83280b082()	Unknown
 	000001c832811490()	Unknown
 	000001c83280d273()	Unknown
 	000001c83227a9d3()	Unknown
 	000001c8323dc621()	Unknown
 	000001c8323a77ec()	Unknown
 	000001c8322db7f4()	Unknown
 	000001c832c4b87c()	Unknown
 	000001c83280b082()	Unknown
 	000001c832811490()	Unknown
 	000001c83280d273()	Unknown
 	000001c83227a9d3()	Unknown
 	000001c8323dc621()	Unknown
 	000001c8323a77ec()	Unknown
 	000001c8322db7f4()	Unknown
 	000001c832c4b87c()	Unknown
 	000001c83280b082()	Unknown
 	000001c83280a735()	Unknown
 	000001c832478e00()	Unknown
 	000001c832472f78()	Unknown
 	000001c832c448f0()	Unknown
 	000001c832c8f1b4()	Unknown
 	000001c83245a43f()	Unknown
 	000001c8321d174a()	Unknown
 	000001c8321d10d4()	Unknown
 	000001c832e7f3cc()	Unknown
 	000001c832e7f9cb()	Unknown
 	wasmtime.DLL!wasmtime_setjmp�()	C
 	wasmtime.DLL!_ZN16wasmtime_runtime12traphandlers84_$LT$impl$u20$wasmtime_runtime..traphandlers..call_thread_state..CallThreadState$GT$4with17hdfcdc5819fd3d891E�()	Unknown
 	wasmtime.DLL!_ZN16wasmtime_runtime12traphandlers11catch_traps17h264d3d0c425ce7faE�()	Unknown
 	wasmtime.DLL!_ZN8wasmtime4func4Func18call_unchecked_raw17h5925ade4c0f7f70cE.llvm.9894362628245677377�()	Unknown
 	wasmtime.DLL!_ZN8wasmtime4func4Func9call_impl17h082023ddfcd49e51E.llvm.9894362628245677377�()	Unknown
 	wasmtime.DLL!wasmtime_func_call�()	Unknown
 	00007fff8d41c1e6()	Unknown
 	00007fff8d41bfac()	Unknown
 	00007fff8d41bca1()	Unknown
 	00007fff8d41bb35()	Unknown
 	00007fff8d419cdb()	Unknown
 	00007fff8d41930b()	Unknown
 	00007fff8d35bc32()	Unknown
 	00007fff8d352d94()	Unknown
 	00007fff8d3495ff()	Unknown
 	00007fff8cbfd678()	Unknown
 	00007fff8cbfd53d()	Unknown
 	00007fff8cbfd359()	Unknown
 	System.Private.CoreLib.dll!00007fff4f18274f()	Unknown
 	coreclr.dll!00007fffec2eaac3()	Unknown
 	coreclr.dll!00007fffec1e6d9c()	Unknown
 	coreclr.dll!00007fffec2cc413()	Unknown
 	coreclr.dll!00007fffec1c36f5()	Unknown
 	coreclr.dll!00007fffec1c35fa()	Unknown
 	coreclr.dll!00007fffec1c3419()	Unknown
 	kernel32.dll!00007ff80b6e74b4()	Unknown
 	ntdll.dll!00007ff80c3026a1()	Unknown

Using the previous commit (33a1925), the accesss violation does not occur.

While I can consistently reproduce the access violation in our application, it seems a bit hard to reproduce in a simple console app; also I can only reproduce it with a Wasmtime release build, bot nut with a debug build.
However, since it seems to have the same root cause as the excess compilation time, maybe it will automatically fixed together with this issue.

Thanks!

@alexcrichton
Copy link
Member

cc @fitzgen for the slowdown here

For the crash you mention that the commit prior to 28cf995 doesn't have the issue, but you also mention that it seems to have the same root cause with compliation time, and I'm curious what leads you to that conclusion? Additionally would you be able to share the .NET reproduction, even if it's tough to minimize further at this time?

@kpreisser
Copy link
Contributor Author

Hi @alexcrichton,

sorry, the "same root cause" was probably a bit overstatement - I just noticed the crash starts happening together with the compilation slowdown, but I'm not actually sure if it really has the same root cause.

I will try to isolate a .NET repro case of this crash, but it may take a while (I think a few days).

Thank you!

@alexcrichton
Copy link
Member

For the compile-time regression gc.wasm.gz is a single-function reproduction of the issue, taking 16s to compile locally on main and 252ms on 3.0.0 (I'm assuming the above bisection still stands, these just happened to be the builds I had on hand)

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Thanks for filing this bug @kpreisser! And thanks for the reproducer, Alex. Digging in.

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

All the compilation time is in regalloc:

$ RUST_LOG=wasmtime_cranelift=trace cargo run --release -- compile --disable-cache ~/scratch/slow.wasm 
    Finished release [optimized + debuginfo] target(s) in 0.16s
     Running `target/release/wasmtime compile --disable-cache /home/nick/scratch/slow.wasm`
 DEBUG wasmtime_cranelift::compiler > FuncIndex(171) translated in 11.172192689s
 TRACE wasmtime_cranelift::compiler > FuncIndex(171) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.094    0.094  Translate WASM function
  22.156    0.087  Compilation passes
   0.003    0.003  Control flow graph
   0.013    0.013  Dominator tree
   0.001    0.001  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.003    0.003  Dead code elimination
   0.023    0.023  Global value numbering
   0.208    0.026  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.003    0.003  Remove constant phi-nodes
   0.051    0.051  VCode lowering
   0.018    0.018  VCode emission
   0.001    0.001  VCode emission finalization
  10.673   10.673  Register allocation
   0.176    0.176  Layout full renumbering
======== ========  ==================================

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Added some logging for the number of vcode instructions and blocks we get after lowering.

Before:

 DEBUG cranelift_codegen::machinst::compile > Number of lowered vcode instructions: 35194
 DEBUG cranelift_codegen::machinst::compile > Number of lowered vcode blocks: 3972

After:

 DEBUG cranelift_codegen::machinst::compile > Number of lowered vcode instructions: 149067
 DEBUG cranelift_codegen::machinst::compile > Number of lowered vcode blocks: 38237

So 4.2x more instructions and 9.6x more blocks. That's... more than expected.

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

And the number of CLIF instructions and blocks we lower:

Before:

 DEBUG cranelift_codegen::machinst::compile > Number of CLIF instructions to lower: 98854
 DEBUG cranelift_codegen::machinst::compile > Number of CLIF blocks to lower: 2446

After:

 DEBUG cranelift_codegen::machinst::compile > Number of CLIF instructions to lower: 279615
 DEBUG cranelift_codegen::machinst::compile > Number of CLIF blocks to lower: 36468

So 2.8x more clif instructions and 14.9x more clif blocks.

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Okay so I took a look at the following Wasm, which does an identical load 20 times:

(module
  (memory (export "memory") 1)
  (func (export "f") (param i32) (result i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
    local.get 0
    i32.load
  )
)

Here is the final CLIF from before:

 TRACE cranelift_codegen::machinst::lower      > about to lower function: function u0:0(i64 vmctx, i64, i32) -> i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 wasmtime_system_v {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    heap0 = static gv4, min 0, bound 0x0001_0000_0000, offset_guard 0x8000_0000, index_type i32
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0043                               v63 = uextend.i64 v2
@0043                               v64 = load.i64 notrap aligned readonly v0+80
@0043                               v23 = iadd v64, v63
@0043                               v24 = load.i32 little heap v23
@00a5                               jump block1

                                block1:
@00a5                               return v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24, v24
}

And here is the final clif after:

 TRACE cranelift_codegen::machinst::lower      > about to lower function: function u0:0(i64 vmctx, i64, i32) -> i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 wasmtime_system_v {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    heap0 = static gv4, min 0, bound 0x0001_0000_0000, offset_guard 0x8000_0000, index_type i32
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0043                               v63 = uextend.i64 v2
                                    v70 = iconst.i64 0xffff_fffc
@0043                               v64 = icmp ugt v63, v70  ; v70 = 0xffff_fffc
@0043                               brz v64, block3
@0043                               jump block2

                                block2:
@0043                               trap heap_oob

                                block3:
@0043                               v66 = load.i64 notrap aligned readonly v0+80
@0043                               v67 = iadd v66, v63
@0043                               v68 = iconst.i64 0
@0043                               v23 = select_spectre_guard v64, v68, v67  ; v68 = 0
@0043                               v24 = load.i32 little heap v23
@0048                               brz.i8 v64, block5
@0048                               jump block4

                                block4:
@0048                               trap heap_oob

                                block5:
@0048                               v25 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0048                               v26 = load.i32 little heap v25
@004d                               brz.i8 v64, block7
@004d                               jump block6

                                block6:
@004d                               trap heap_oob

                                block7:
@004d                               v27 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@004d                               v28 = load.i32 little heap v27
@0052                               brz.i8 v64, block9
@0052                               jump block8

                                block8:
@0052                               trap heap_oob

                                block9:
@0052                               v29 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0052                               v30 = load.i32 little heap v29
@0057                               brz.i8 v64, block11
@0057                               jump block10

                                block10:
@0057                               trap heap_oob

                                block11:
@0057                               v31 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0057                               v32 = load.i32 little heap v31
@005c                               brz.i8 v64, block13
@005c                               jump block12

                                block12:
@005c                               trap heap_oob

                                block13:
@005c                               v33 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@005c                               v34 = load.i32 little heap v33
@0061                               brz.i8 v64, block15
@0061                               jump block14

                                block14:
@0061                               trap heap_oob

                                block15:
@0061                               v35 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0061                               v36 = load.i32 little heap v35
@0066                               brz.i8 v64, block17
@0066                               jump block16

                                block16:
@0066                               trap heap_oob

                                block17:
@0066                               v37 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0066                               v38 = load.i32 little heap v37
@006b                               brz.i8 v64, block19
@006b                               jump block18

                                block18:
@006b                               trap heap_oob

                                block19:
@006b                               v39 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@006b                               v40 = load.i32 little heap v39
@0070                               brz.i8 v64, block21
@0070                               jump block20

                                block20:
@0070                               trap heap_oob

                                block21:
@0070                               v41 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0070                               v42 = load.i32 little heap v41
@0075                               brz.i8 v64, block23
@0075                               jump block22

                                block22:
@0075                               trap heap_oob

                                block23:
@0075                               v43 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0075                               v44 = load.i32 little heap v43
@007a                               brz.i8 v64, block25
@007a                               jump block24

                                block24:
@007a                               trap heap_oob

                                block25:
@007a                               v45 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@007a                               v46 = load.i32 little heap v45
@007f                               brz.i8 v64, block27
@007f                               jump block26

                                block26:
@007f                               trap heap_oob

                                block27:
@007f                               v47 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@007f                               v48 = load.i32 little heap v47
@0084                               brz.i8 v64, block29
@0084                               jump block28

                                block28:
@0084                               trap heap_oob

                                block29:
@0084                               v49 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0084                               v50 = load.i32 little heap v49
@0089                               brz.i8 v64, block31
@0089                               jump block30

                                block30:
@0089                               trap heap_oob

                                block31:
@0089                               v51 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0089                               v52 = load.i32 little heap v51
@008e                               brz.i8 v64, block33
@008e                               jump block32

                                block32:
@008e                               trap heap_oob

                                block33:
@008e                               v53 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@008e                               v54 = load.i32 little heap v53
@0093                               brz.i8 v64, block35
@0093                               jump block34

                                block34:
@0093                               trap heap_oob

                                block35:
@0093                               v55 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0093                               v56 = load.i32 little heap v55
@0098                               brz.i8 v64, block37
@0098                               jump block36

                                block36:
@0098                               trap heap_oob

                                block37:
@0098                               v57 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@0098                               v58 = load.i32 little heap v57
@009d                               brz.i8 v64, block39
@009d                               jump block38

                                block38:
@009d                               trap heap_oob

                                block39:
@009d                               v59 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@009d                               v60 = load.i32 little heap v59
@00a2                               brz.i8 v64, block41
@00a2                               jump block40

                                block40:
@00a2                               trap heap_oob

                                block41:
@00a2                               v61 = select_spectre_guard.i64 v64, v68, v67  ; v68 = 0
@00a2                               v62 = load.i32 little heap v61
@00a5                               jump block1

                                block1:
@00a5                               return v24, v26, v28, v30, v32, v34, v36, v38, v40, v42, v44, v46, v48, v50, v52, v54, v56, v58, v60, v62
}

So the before is successfully de-duping all the loads and the after is not, which explains a lot here.

But I'm a bit confused why the before clif does not have a spectre guard at all? That is what is preventing the de-duping in the after clif.

@cfallin
Copy link
Member

cfallin commented Nov 28, 2022

Ah, the Spectre guard isn't needed in this case because the guard region is large enough: for any value in 0..2^32-1 of v2, the base+index will land in the guard. So it seems with the patch we are unnecessarily re-introducing Spectre guards; that's the problem, rather than something about the way we lower them.

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Okay, before the heap access size is being reported as 1 which makes no sense since an i32 is being loaded. In after, it is correctly reported as 4. This pushes the limit just below the 0xffff_ffff threshold where explicit bounds checks are needed.

Trying to figure out why the access size changed at all here.

@cfallin
Copy link
Member

cfallin commented Nov 28, 2022

The offset guard is sufficient for that, though, no? The heap is specified as 4GiB size plus 2GiB offset guard; so in the worst case if the inbound Wasm pointer (v2 in CLIF) is 0xffff_ffff, we will access up to 0x1_0000_0003 (inclusive), but that's within the +2GiB guard.

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Yeah it seems this logic for factoring in guard page sizes was in cranelift-wasm rather than heap_addr. Trying to untangle now...

@fitzgen
Copy link
Member

fitzgen commented Nov 28, 2022

Okay I've put up #5335 and also double checked that it returns us to our original compile times for this issue's test case.

@kpreisser
Copy link
Contributor Author

I will try to isolate a .NET repro case of this crash, but it may take a while (I think a few days).

I have now managed to isolate the .NET code required to reproduce the access violation with Wasmtime.Dotnet. I think I will create a new issue for that, as I'm not sure if it is actually related to the regression of the compile time.
Thanks!

fitzgen added a commit that referenced this issue Nov 29, 2022
…5335)

* Cranelift: consider heap's guard pages when legalizing `heap_addr`

Fixes #5328

* Update comment to align more directly with implementation

* Add legalization tests for `heap_addr` and offset guard pages
@kpreisser
Copy link
Contributor Author

Thank you for the quick fix of this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants