Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

abrown · 2020-11-19T17:15:57Z

@jlb6740, I am opening this issue to track a failure I noticed with x64 SIMD: https://github.com/bytecodealliance/wasmtime/pull/2428/checks?check_run_id=1421136203#step:7:1582

abrown · 2020-11-19T17:20:37Z

Looks like the issue is related to f64x2.abs:

(assert_return (invoke "f64x2.abs" (v128.const f64x2 0x1p-1074 0x1p-1074))
                                   (v128.const f64x2 0x0.0000000000001p-1022 0x0.0000000000001p-1022))

abrown · 2020-11-30T22:44:37Z

Looks like another instance of this: https://github.com/bytecodealliance/wasmtime/runs/1476300496?check_suite_focus=true#step:7:1883. It has multiple failures:

failures:
    wast::Cranelift::spec::simd::simd_f32x4
    wast::Cranelift::spec::simd::simd_f64x2
    wast::Cranelift::spec::simd::simd_f64x2_arith
    wast::Cranelift::spec::simd::simd_load
    wast::Cranelift::spec::simd::simd_splat

abrown · 2020-12-03T16:41:24Z

And another one reported by @alexcrichton in #2470 with splat: https://github.com/bytecodealliance/wasmtime/runs/1489518375?check_suite_focus=true#step:7:1807

jlb6740 · 2020-12-03T18:03:28Z

@abrown Hi .. Thanks. It's hard to say if this issue from 2 weeks ago and the one from 3 days ago is the same but they both involve floating point and both involve failures in tests that shouldn't have been impacted by the subject patch. I will look at the failures seen here: https://github.com/bytecodealliance/wasmtime/runs/1476300496?check_suite_focus=true#step:7:1883 during merging. Try to reproduce and also see what valgrind reports.

abrown · 2020-12-03T18:13:09Z

For reference, previously build.rs had some code that would disable certain FP tests when using a certain version of Linux libc that caused FP parsing issues. This was removed here with the switch from wabt to wat. After looking at it, I am discarding this as a cause since we have not seen issues with that in the intervening year (these failures are recent and SIMD-related) but I bring it up since parsing still could be a potential cause.

jlb6740 · 2020-12-08T05:33:52Z

FYI, I am trying to see if I can pinpoint the patch that caused the issue. Basically running:

cargo test -p wasmtime-cli --features experimental_x64 wast::Cranelift::spec::simd::simd_

In a loop like so:

COUNTER=0 && while [ $COUNTER -lt 1000 ] && ./run_test.sh && COUNTER=$((COUNTER+1)) && echo -e "\n\nIteration: $COUNTER\n\n" ; do :; done | tee run_test.out

where the cargo test command is in ./run_test.sh.

It seems the failure usually occurs within 100 iterations but can take as long as 200-300.

abrown · 2020-12-08T17:25:56Z

That's some progress!

julian-seward1 · 2020-12-08T17:40:13Z

@jlb6740 Can you extract from that, the actual non-shell-script program invokation that ./run_test.sh` does, that is failing, and then run that on valgrind, to see if something is reading uninitialised memory?

jlb6740 · 2020-12-09T07:26:36Z

I've been attempting to both (1) isolate the patch where the issue starts and (2) simplify the reproducer but both are proving to be more difficult than anticipated. Some observations ..

W.r.t pinpointing the patch that introduced the issue, I am finding it impossible to reproduce this issue using the script above before #2365 . However I am not thinking this patch introduces this bug since it was seen as earlier as October here: https://github.com/bytecodealliance/wasmtime/runs/1323686030

W.r.t. simplifying the reproducer I am finding the error almost always occurs with the same tests involving negation or abs .. test such as:

(assert_return (invoke "add-neg" (v128.const f64x2 1.125 1.125)
(v128.const f64x2 0.125 0.125))
(v128.const f64x2 -1.0 -1.0))

or

(assert_return (invoke "f64x2.abs" (v128.const f64x2 0x1.fffffffffffffp+1023 0x1.fffffffffffffp+1023))
(v128.const f64x2 0x1.fffffffffffffp+1023 0x1.fffffffffffffp+1023))

But when I try to isolate just these tests and run, the problem is not reproduced. It is hard for me to tell if the issue is with something in the lowering or something in the wast testing path.

I did run valgrind against a277cf for example and there is a conspicuous error:

==4490== Conditional jump or move depends on uninitialised value(s)
==4490==    at 0x48A056: core::iter::traits::iterator::Iterator::all::check::{{closure}} (iterator.rs:2114)
==4490==    by 0x484EBD: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} (mod.rs:1413)
==4490==    by 0x46FECB: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==4490==    by 0x484945: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold (mod.rs:1420)
==4490==    by 0x4842C0: core::iter::traits::iterator::Iterator::all (iterator.rs:2117)
==4490==    by 0x4791DD: wasmtime_wast::wast::v128_matches (wast.rs:471)
==4490==    by 0x478B77: wasmtime_wast::wast::val_matches (wast.rs:411)
==4490==    by 0x47456B: wasmtime_wast::wast::WastContext::assert_return (wast.rs:185)
==4490==    by 0x476693: wasmtime_wast::wast::WastContext::run_directive (wast.rs:272)
==4490==    by 0x47530B: wasmtime_wast::wast::WastContext::run_buffer (wast.rs:229)
==4490==    by 0x478201: wasmtime_wast::wast::WastContext::run_file (wast.rs:353)
==4490==    by 0x333EF2: all::wast::run_wast (wast.rs:46)
==4490==  Uninitialised value was created by a stack allocation
==4490==    at 0x56B530: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::get_unchecked (zip.rs:291)

Which I think actually involves these lines here: https://github.com/bytecodealliance/wasmtime/blob/main/crates/wast/src/wast.rs#L507-L514 in head.

It is not obvious to me though how to correct this yet. I think this bug should be understood and remove as a next step.

julian-seward1 · 2020-12-09T09:03:32Z

Just to be on the safe side .. is this with valgrind 3.16.0 or later? That has the lowest false-positive rate so far; I would recommend against any investigation based on output from a version before 3.16.0.

cfallin · 2020-12-09T10:07:24Z

So I was nerdsniped by this and chased it for a while -- with valgrind 3.16.1 (Fedora 33) I'm seeing the uninit value in the values_vec through which the trampoline passes the wasm func's return value (crates/wasmtime/src/func.rs) -- insert a println there and the uninit value is used right away, problem isn't in wast checks. The value's origin (valgrind --track-origins=yes) is:

==473121==  Uninitialised value was created by a stack allocation
==473121==    at 0xEFC9F0: wasmtime_runtime::traphandlers::setup_unix_sigaltstack::TLS::__init (local.rs:157)

So what I can gather from that is that we're reading uninitialized stack somewhere. I spent a while staring at the generated trampoline functions and am not seeing anything obviously wrong, but is it possible that maybe we're calling the wrong trampoline (too many/too few args or returns)?

Another oddity I noticed: there is no WasmTy impl for u128. Most vector code seems to work fine without this, though perhaps this is A Problem anyway (@alexcrichton any thoughts on this, plus @jlb6740's note above that the uninit behavior seems to go away prior to #2365?).

julian-seward1 · 2020-12-09T10:24:38Z

insert a println there and the uninit value is used right away

That reduces the chance that it's a false positive. The clincher would however be if you can demonstrate that multiple runs print different values at that point.

alexcrichton · 2020-12-09T15:20:59Z

To confirm @jlb6740, are you testing recent master? #2365 had a use-after-free but if modules/stores were deallocated in the wrong order (later fixed though). That might explain why just after that commit valgrind finds issues.

As for WasmTy for u128, the reason we haven't implemented that yet is mainly that there's no native ABI for Rust to call a cranelift-generated function that takes u128 right now. Or at least no one's said 'yep this works let us add it'. Once that's possible we can add WasmTy for u128, however.

jlb6740 · 2020-12-09T17:44:18Z

@julian-seward1 I have valgrind-3.15.0 installed. I will update to a 3.16 or higher. @alexcrichton the valgrind run is based on commit a277cf which was the last time I could readily reproduce. I've tried patches though since @cfallin disabled the tests and could reproduce up until then. Let me rerun the latest valgrind on the latest build with these tests enabled.

alexcrichton · 2020-12-09T19:52:47Z

Ah ok in that case I'm not entirely sure what would be causing this unfortunately :(

jlb6740 · 2020-12-09T20:11:07Z

@alexcrichton yes, just needs more investigation. I've confirmed that use of unitialized values still occurs with the latest build (e09b940) with valgrind 3.16.1

==22915== Conditional jump or move depends on uninitialised value(s)
==22915==    at 0x50D676: core::iter::traits::iterator::Iterator::all::check::{{closure}} (iterator.rs:2114)
==22915==    by 0x52297D: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} (mod.rs:1413)
==22915==    by 0x50B1EB: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==22915==    by 0x522405: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold (mod.rs:1420)
==22915==    by 0x51E660: core::iter::traits::iterator::Iterator::all (iterator.rs:2117)
==22915==    by 0x4FEAB7: wasmtime_wast::wast::v128_matches (wast.rs:507)
==22915==    by 0x4FE497: wasmtime_wast::wast::val_matches (wast.rs:451)
==22915==    by 0x4FC757: wasmtime_wast::wast::WastContext::assert_return (wast.rs:186)
==22915==    by 0x518A7F: wasmtime_wast::wast::WastContext::run_directive (wast.rs:287)
==22915==    by 0x4FD713: wasmtime_wast::wast::WastContext::run_buffer (wast.rs:236)
==22915==    by 0x4FDB01: wasmtime_wast::wast::WastContext::run_file (wast.rs:368)
==22915==    by 0x404274: all::wast::run_wast (wast.rs:48)
==22915==  Uninitialised value was created by a stack allocation
==22915==    at 0x8244E0: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::get_unchecked (zip.rs:291)
==22915==

… not being NaN An intermittent failure during SIMD spectests is described in bytecodealliance#2432. This patch corrects code written in a way that assumes comparing fp equality of a register with itself will always return true. This is not true when the register value is NaN as NaN. In this case, and with all ordered comparisons involving NaN, the comparisons will always return false. This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only instructions generating the failure with bytecodealliance#2432.

…ecodealliance#2470. Enable new tests Re-enables spec tests that were turned off for bytecodealliance#2432 and bytecodealliance#2470 while also enabling tests that now work due to patch pushes in the interim. Currently all SIMD spec tests past. Testing to assure this is ok to enable hasn't been super intense so we should monitor but there was an attempt of doing 1000 runs 3 different times to try and reproduce the issue and it did not occur. In the past would have occurred several times with that many runs.

… not being NaN An intermittent failure during SIMD spectests is described in bytecodealliance#2432. This patch corrects code written in a way that assumes comparing fp equality of a register with itself will always return true. This is not true when the register value is NaN as NaN. In this case, and with all ordered comparisons involving NaN, the comparisons will always return false. This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only instructions generating the failure with bytecodealliance#2432.

…ecodealliance#2470. Enable new tests Re-enables spec tests that were turned off for bytecodealliance#2432 and bytecodealliance#2470 while also enabling tests that now work due to patch pushes in the interim. Currently all SIMD spec tests past. Testing to assure this is ok to enable hasn't been super intense so we should monitor but there was an attempt of doing 1000 runs 3 different times to try and reproduce the issue and it did not occur. In the past would have occurred several times with that many runs.

… not being NaN An intermittent failure during SIMD spectests is described in #2432. This patch corrects code written in a way that assumes comparing fp equality of a register with itself will always return true. This is not true when the register value is NaN as NaN. In this case, and with all ordered comparisons involving NaN, the comparisons will always return false. This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only instructions generating the failure with #2432.

…ests Re-enables spec tests that were turned off for #2432 and #2470 while also enabling tests that now work due to patch pushes in the interim. Currently all SIMD spec tests past. Testing to assure this is ok to enable hasn't been super intense so we should monitor but there was an attempt of doing 1000 runs 3 different times to try and reproduce the issue and it did not occur. In the past would have occurred several times with that many runs.

jlb6740 · 2021-01-14T18:57:15Z

@abrown @cfallin Perhaps we can close this now? Please re-open if someone disagrees.

abrown mentioned this issue Nov 19, 2020

wiggle: support overlapping immutable borrows #2428

Merged

cfallin mentioned this issue Dec 3, 2020

Possible nondeterminism bug with SIMD and new new x64 backend #2470

Closed

abrown mentioned this issue Dec 4, 2020

Improve WAST error messaging, NaN comparisons #2475

Merged

abrown changed the title ~~x64 simd failure~~ Investigate non-deterministic failures with SIMD spec tests on the x64 backend Dec 4, 2020

abrown mentioned this issue Dec 4, 2020

[machinst x64]: implement dot product #2480

Closed

abrown mentioned this issue Dec 8, 2020

[machinst x64]: implement load*_zero for x64 #2489

Merged

jlb6740 mentioned this issue Dec 16, 2020

Implement x64 simd loads #2510

Merged

cfallin mentioned this issue Jan 14, 2021

Run Cranelift tests under Valgrind to catch JIT issues #2582

Open

jlb6740 closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

abrown commented Nov 19, 2020 •

edited

Loading

abrown commented Nov 19, 2020

abrown commented Nov 30, 2020 •

edited

Loading

abrown commented Dec 3, 2020

jlb6740 commented Dec 3, 2020

abrown commented Dec 3, 2020

jlb6740 commented Dec 8, 2020

abrown commented Dec 8, 2020

julian-seward1 commented Dec 8, 2020

jlb6740 commented Dec 9, 2020 •

edited

Loading

julian-seward1 commented Dec 9, 2020

cfallin commented Dec 9, 2020

julian-seward1 commented Dec 9, 2020

alexcrichton commented Dec 9, 2020

jlb6740 commented Dec 9, 2020

alexcrichton commented Dec 9, 2020

jlb6740 commented Dec 9, 2020 •

edited

Loading

jlb6740 commented Jan 14, 2021

Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

Comments

abrown commented Nov 19, 2020 • edited Loading

abrown commented Nov 19, 2020

abrown commented Nov 30, 2020 • edited Loading

abrown commented Dec 3, 2020

jlb6740 commented Dec 3, 2020

abrown commented Dec 3, 2020

jlb6740 commented Dec 8, 2020

abrown commented Dec 8, 2020

julian-seward1 commented Dec 8, 2020

jlb6740 commented Dec 9, 2020 • edited Loading

julian-seward1 commented Dec 9, 2020

cfallin commented Dec 9, 2020

julian-seward1 commented Dec 9, 2020

alexcrichton commented Dec 9, 2020

jlb6740 commented Dec 9, 2020

alexcrichton commented Dec 9, 2020

jlb6740 commented Dec 9, 2020 • edited Loading

jlb6740 commented Jan 14, 2021

abrown commented Nov 19, 2020 •

edited

Loading

abrown commented Nov 30, 2020 •

edited

Loading

jlb6740 commented Dec 9, 2020 •

edited

Loading

jlb6740 commented Dec 9, 2020 •

edited

Loading