Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate non-deterministic failures with SIMD spec tests on the x64 backend #2432

Closed
abrown opened this issue Nov 19, 2020 · 17 comments
Closed

Comments

@abrown
Copy link
Contributor

abrown commented Nov 19, 2020

@jlb6740, I am opening this issue to track a failure I noticed with x64 SIMD: https://github.com/bytecodealliance/wasmtime/pull/2428/checks?check_run_id=1421136203#step:7:1582

@abrown
Copy link
Contributor Author

abrown commented Nov 19, 2020

Looks like the issue is related to f64x2.abs:

(assert_return (invoke "f64x2.abs" (v128.const f64x2 0x1p-1074 0x1p-1074))
                                   (v128.const f64x2 0x0.0000000000001p-1022 0x0.0000000000001p-1022))

@abrown
Copy link
Contributor Author

abrown commented Nov 30, 2020

Looks like another instance of this: https://github.com/bytecodealliance/wasmtime/runs/1476300496?check_suite_focus=true#step:7:1883. It has multiple failures:

failures:
    wast::Cranelift::spec::simd::simd_f32x4
    wast::Cranelift::spec::simd::simd_f64x2
    wast::Cranelift::spec::simd::simd_f64x2_arith
    wast::Cranelift::spec::simd::simd_load
    wast::Cranelift::spec::simd::simd_splat

@abrown
Copy link
Contributor Author

abrown commented Dec 3, 2020

@jlb6740
Copy link
Contributor

jlb6740 commented Dec 3, 2020

@abrown Hi .. Thanks. It's hard to say if this issue from 2 weeks ago and the one from 3 days ago is the same but they both involve floating point and both involve failures in tests that shouldn't have been impacted by the subject patch. I will look at the failures seen here: https://github.com/bytecodealliance/wasmtime/runs/1476300496?check_suite_focus=true#step:7:1883 during merging. Try to reproduce and also see what valgrind reports.

@abrown
Copy link
Contributor Author

abrown commented Dec 3, 2020

For reference, previously build.rs had some code that would disable certain FP tests when using a certain version of Linux libc that caused FP parsing issues. This was removed here with the switch from wabt to wat. After looking at it, I am discarding this as a cause since we have not seen issues with that in the intervening year (these failures are recent and SIMD-related) but I bring it up since parsing still could be a potential cause.

@abrown abrown changed the title x64 simd failure Investigate non-deterministic failures with SIMD spec tests on the x64 backend Dec 4, 2020
@jlb6740
Copy link
Contributor

jlb6740 commented Dec 8, 2020

FYI, I am trying to see if I can pinpoint the patch that caused the issue. Basically running:

cargo test -p wasmtime-cli --features experimental_x64 wast::Cranelift::spec::simd::simd_

In a loop like so:

COUNTER=0 && while [ $COUNTER -lt 1000 ] && ./run_test.sh && COUNTER=$((COUNTER+1)) && echo -e "\n\nIteration: $COUNTER\n\n" ; do :; done | tee run_test.out

where the cargo test command is in ./run_test.sh.

It seems the failure usually occurs within 100 iterations but can take as long as 200-300.

@abrown
Copy link
Contributor Author

abrown commented Dec 8, 2020

That's some progress!

@julian-seward1
Copy link
Contributor

@jlb6740 Can you extract from that, the actual non-shell-script program invokation that ./run_test.sh` does, that is failing, and then run that on valgrind, to see if something is reading uninitialised memory?

@jlb6740
Copy link
Contributor

jlb6740 commented Dec 9, 2020

I've been attempting to both (1) isolate the patch where the issue starts and (2) simplify the reproducer but both are proving to be more difficult than anticipated. Some observations ..

W.r.t pinpointing the patch that introduced the issue, I am finding it impossible to reproduce this issue using the script above before #2365 . However I am not thinking this patch introduces this bug since it was seen as earlier as October here: https://github.com/bytecodealliance/wasmtime/runs/1323686030

W.r.t. simplifying the reproducer I am finding the error almost always occurs with the same tests involving negation or abs .. test such as:

(assert_return (invoke "add-neg" (v128.const f64x2 1.125 1.125)
(v128.const f64x2 0.125 0.125))
(v128.const f64x2 -1.0 -1.0))

or

(assert_return (invoke "f64x2.abs" (v128.const f64x2 0x1.fffffffffffffp+1023 0x1.fffffffffffffp+1023))
(v128.const f64x2 0x1.fffffffffffffp+1023 0x1.fffffffffffffp+1023))

But when I try to isolate just these tests and run, the problem is not reproduced. It is hard for me to tell if the issue is with something in the lowering or something in the wast testing path.

I did run valgrind against a277cf for example and there is a conspicuous error:

==4490== Conditional jump or move depends on uninitialised value(s)
==4490==    at 0x48A056: core::iter::traits::iterator::Iterator::all::check::{{closure}} (iterator.rs:2114)
==4490==    by 0x484EBD: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} (mod.rs:1413)
==4490==    by 0x46FECB: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==4490==    by 0x484945: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold (mod.rs:1420)
==4490==    by 0x4842C0: core::iter::traits::iterator::Iterator::all (iterator.rs:2117)
==4490==    by 0x4791DD: wasmtime_wast::wast::v128_matches (wast.rs:471)
==4490==    by 0x478B77: wasmtime_wast::wast::val_matches (wast.rs:411)
==4490==    by 0x47456B: wasmtime_wast::wast::WastContext::assert_return (wast.rs:185)
==4490==    by 0x476693: wasmtime_wast::wast::WastContext::run_directive (wast.rs:272)
==4490==    by 0x47530B: wasmtime_wast::wast::WastContext::run_buffer (wast.rs:229)
==4490==    by 0x478201: wasmtime_wast::wast::WastContext::run_file (wast.rs:353)
==4490==    by 0x333EF2: all::wast::run_wast (wast.rs:46)
==4490==  Uninitialised value was created by a stack allocation
==4490==    at 0x56B530: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::get_unchecked (zip.rs:291)

Which I think actually involves these lines here: https://github.com/bytecodealliance/wasmtime/blob/main/crates/wast/src/wast.rs#L507-L514 in head.

It is not obvious to me though how to correct this yet. I think this bug should be understood and remove as a next step.

@julian-seward1
Copy link
Contributor

Just to be on the safe side .. is this with valgrind 3.16.0 or later? That has the lowest false-positive rate so far; I would recommend against any investigation based on output from a version before 3.16.0.

@cfallin
Copy link
Member

cfallin commented Dec 9, 2020

So I was nerdsniped by this and chased it for a while -- with valgrind 3.16.1 (Fedora 33) I'm seeing the uninit value in the values_vec through which the trampoline passes the wasm func's return value (crates/wasmtime/src/func.rs) -- insert a println there and the uninit value is used right away, problem isn't in wast checks. The value's origin (valgrind --track-origins=yes) is:

==473121==  Uninitialised value was created by a stack allocation
==473121==    at 0xEFC9F0: wasmtime_runtime::traphandlers::setup_unix_sigaltstack::TLS::__init (local.rs:157)

So what I can gather from that is that we're reading uninitialized stack somewhere. I spent a while staring at the generated trampoline functions and am not seeing anything obviously wrong, but is it possible that maybe we're calling the wrong trampoline (too many/too few args or returns)?

Another oddity I noticed: there is no WasmTy impl for u128. Most vector code seems to work fine without this, though perhaps this is A Problem anyway (@alexcrichton any thoughts on this, plus @jlb6740's note above that the uninit behavior seems to go away prior to #2365?).

@julian-seward1
Copy link
Contributor

insert a println there and the uninit value is used right away

That reduces the chance that it's a false positive. The clincher would however be if you can demonstrate that multiple runs print different values at that point.

@alexcrichton
Copy link
Member

To confirm @jlb6740, are you testing recent master? #2365 had a use-after-free but if modules/stores were deallocated in the wrong order (later fixed though). That might explain why just after that commit valgrind finds issues.

As for WasmTy for u128, the reason we haven't implemented that yet is mainly that there's no native ABI for Rust to call a cranelift-generated function that takes u128 right now. Or at least no one's said 'yep this works let us add it'. Once that's possible we can add WasmTy for u128, however.

@jlb6740
Copy link
Contributor

jlb6740 commented Dec 9, 2020

@julian-seward1 I have valgrind-3.15.0 installed. I will update to a 3.16 or higher. @alexcrichton the valgrind run is based on commit a277cf which was the last time I could readily reproduce. I've tried patches though since @cfallin disabled the tests and could reproduce up until then. Let me rerun the latest valgrind on the latest build with these tests enabled.

@alexcrichton
Copy link
Member

Ah ok in that case I'm not entirely sure what would be causing this unfortunately :(

@jlb6740
Copy link
Contributor

jlb6740 commented Dec 9, 2020

@alexcrichton yes, just needs more investigation. I've confirmed that use of unitialized values still occurs with the latest build (e09b940) with valgrind 3.16.1

==22915== Conditional jump or move depends on uninitialised value(s)
==22915==    at 0x50D676: core::iter::traits::iterator::Iterator::all::check::{{closure}} (iterator.rs:2114)
==22915==    by 0x52297D: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold::enumerate::{{closure}} (mod.rs:1413)
==22915==    by 0x50B1EB: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==22915==    by 0x522405: <core::iter::adapters::Enumerate<I> as core::iter::traits::iterator::Iterator>::try_fold (mod.rs:1420)
==22915==    by 0x51E660: core::iter::traits::iterator::Iterator::all (iterator.rs:2117)
==22915==    by 0x4FEAB7: wasmtime_wast::wast::v128_matches (wast.rs:507)
==22915==    by 0x4FE497: wasmtime_wast::wast::val_matches (wast.rs:451)
==22915==    by 0x4FC757: wasmtime_wast::wast::WastContext::assert_return (wast.rs:186)
==22915==    by 0x518A7F: wasmtime_wast::wast::WastContext::run_directive (wast.rs:287)
==22915==    by 0x4FD713: wasmtime_wast::wast::WastContext::run_buffer (wast.rs:236)
==22915==    by 0x4FDB01: wasmtime_wast::wast::WastContext::run_file (wast.rs:368)
==22915==    by 0x404274: all::wast::run_wast (wast.rs:48)
==22915==  Uninitialised value was created by a stack allocation
==22915==    at 0x8244E0: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::get_unchecked (zip.rs:291)
==22915== 

jlb6740 added a commit to jlb6740/wasmtime that referenced this issue Jan 14, 2021
… not being NaN

An intermittent failure during SIMD spectests is described in bytecodealliance#2432. This patch
corrects code written in a way that assumes comparing fp equality of a register with itself will
always return true. This is not true when the register value is NaN as NaN. In this case, and
with all ordered comparisons involving NaN, the comparisons will always return false.
This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only
instructions generating the failure with bytecodealliance#2432.
jlb6740 added a commit to jlb6740/wasmtime that referenced this issue Jan 14, 2021
…ecodealliance#2470. Enable new tests

Re-enables spec tests that were turned off for bytecodealliance#2432 and bytecodealliance#2470 while
also enabling tests that now work due to patch pushes in the interim.
Currently all SIMD spec tests past. Testing to assure this is ok to
enable hasn't been super intense so we should monitor but there
was an attempt of doing 1000 runs 3 different times to try and reproduce
the issue and it did not occur. In the past would have occurred several
times with that many runs.
jlb6740 added a commit to jlb6740/wasmtime that referenced this issue Jan 14, 2021
… not being NaN

An intermittent failure during SIMD spectests is described in bytecodealliance#2432. This patch
corrects code written in a way that assumes comparing fp equality of a register with itself will
always return true. This is not true when the register value is NaN as NaN. In this case, and
with all ordered comparisons involving NaN, the comparisons will always return false.
This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only
instructions generating the failure with bytecodealliance#2432.
jlb6740 added a commit to jlb6740/wasmtime that referenced this issue Jan 14, 2021
…ecodealliance#2470. Enable new tests

Re-enables spec tests that were turned off for bytecodealliance#2432 and bytecodealliance#2470 while
also enabling tests that now work due to patch pushes in the interim.
Currently all SIMD spec tests past. Testing to assure this is ok to
enable hasn't been super intense so we should monitor but there
was an attempt of doing 1000 runs 3 different times to try and reproduce
the issue and it did not occur. In the past would have occurred several
times with that many runs.
jlb6740 added a commit that referenced this issue Jan 14, 2021
… not being NaN

An intermittent failure during SIMD spectests is described in #2432. This patch
corrects code written in a way that assumes comparing fp equality of a register with itself will
always return true. This is not true when the register value is NaN as NaN. In this case, and
with all ordered comparisons involving NaN, the comparisons will always return false.
This patch corrects that assumption for SIMD Fabs and Fneg which seem to be the only
instructions generating the failure with #2432.
jlb6740 added a commit that referenced this issue Jan 14, 2021
…ests

Re-enables spec tests that were turned off for #2432 and #2470 while
also enabling tests that now work due to patch pushes in the interim.
Currently all SIMD spec tests past. Testing to assure this is ok to
enable hasn't been super intense so we should monitor but there
was an attempt of doing 1000 runs 3 different times to try and reproduce
the issue and it did not occur. In the past would have occurred several
times with that many runs.
@jlb6740
Copy link
Contributor

jlb6740 commented Jan 14, 2021

@abrown @cfallin Perhaps we can close this now? Please re-open if someone disagrees.

@jlb6740 jlb6740 closed this as completed Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants