Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

wasm trap: out of bounds memory access #10095

Closed
AurevoirXavier opened this issue Oct 25, 2021 · 18 comments · Fixed by #10291
Closed

wasm trap: out of bounds memory access #10095

AurevoirXavier opened this issue Oct 25, 2021 · 18 comments · Fixed by #10291
Assignees

Comments

@AurevoirXavier
Copy link
Contributor

AurevoirXavier commented Oct 25, 2021

Proposing failed: ClientImport(
	RuntimeApiError(
		Application(
			Execution(
				Other(
					"Wasm execution trapped:
					wasm trap: out of bounds memory access
					wasm backtrace:    
						0: 0x12ab38 - <unknown>!pallet_session::historical::ProvingTrie<T>::generate_for::h3b8e607c714d13ce
						1: 0x11734b - <unknown>!<(TupleElement0,TupleElement1) as frame_support::traits::hooks::OnInitialize<BlockNumber>>::on_initialize::h8592203255e3a184
						2: 0x10d0e0 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPallets,COnRuntimeUpgrade>::initialize_block::h06d8223ffc8d1d3b
						3: 0x13901c - <unknown>!Core_initialize_block"
				)
			)
		)
	)
)   

Happened on linux-gnu-nightly-2021-09-01. Downgrade to linux-gnu-nightly-2021-04-22 got fixed.
But still happened on windows-msvc-nightly-2021-04-22, and got fixed by upgrading to windows-msvc-nightly-2021-05-01.
I'm not sure if this is a rust issue or parity wasm issue.

Other projects also encounter this. cc @h4x3rotab

darwinia-network/darwinia-common#879
Phala-Network/phala-blockchain#542

@h4x3rotab
Copy link
Contributor

Maybe related: Phala-Network/phala-blockchain#542

@bkchr
Copy link
Member

bkchr commented Oct 25, 2021

@AurevoirXavier how can we reproduce this?

@AurevoirXavier
Copy link
Contributor Author

@AurevoirXavier how can we reproduce this?

I'm not sure what causes this. I can't give a minimal env to reproduce this.
Sorry.


git clone https://github.com/darwinia-network/darwinia-common
git checkout v2.6.7
// modify the session time to MINUTE / 2, otherwise you need to wait 10 minutes
// https://github.com/darwinia-network/darwinia-common/blob/77dde36f41/node/runtime/pangolin/src/constants/src/lib.rs#L27
cargo build --release
cargo run -- chain pangolin-dev --alice --tmp

At the end of the session, this error will pop up. And the chain get stalled.

@koute
Copy link
Contributor

koute commented Nov 12, 2021

Found the problem. Our WASM instance fast reuse code is not clearing statics between calls.

Here's what exactly happens:

  1. TrieDBMut::new from trie-db is called.
  2. That calls HashSet::new() from hashbrown.
  3. That calls RandomState::new() from ahash.
  4. That initializes an OnceBox from once-cell with a Box<dyn Trait>.
  5. On the first invocation the OnceBox gets initialized and everything works fine.
  6. On the second invocation the WASM instance is reused so the OnceBox is still initialized, but now it points to memory that is technically uninitialized.

A workaround for this issue would be to either switch to an interpreted WASM runtime or go into substrate/client/executor/src/wasm_runtime.rs and change fast_instance_reuse: true to fast_instance_reuse: false (do not do it; that's broken).

@bkchr I know that the fast reuse codepath deliberately retains some things between calls (e.g. the linear memory itself); not sure if this behavior is intended, but if it is then in my opinion this is way too dangerous to leave unfixed.

@bkchr
Copy link
Member

bkchr commented Nov 12, 2021

Okay. That doesn't sound good. Fast instance reuse should explicitly reset static variables.

Cc @pepyakin

@bkchr
Copy link
Member

bkchr commented Nov 12, 2021

fn returns_mutable_static(wasm_method: WasmExecutionMethod) {

This should test it

@pepyakin
Copy link
Contributor

or the returns_mutable_static_bss. That one tests the zero case.

@koute
Copy link
Contributor

koute commented Nov 12, 2021

Interesting; I'll check why the tests don't fail here.

@pepyakin
Copy link
Contributor

FWIW, filled an issue that I meant to fill for some time #10244

TL;DR: Since wasmtime evolved quite some, we may not need to use fast_instance_reuse.

@koute
Copy link
Contributor

koute commented Nov 16, 2021

Okay, it looks like the static clearing actually works, but it needs the (data $.bss ...) WASM section to work. For the test runtime that section is properly generated, while for the production runtime it isn't.

To get rustc to generate it the -C link-arg=--import-memory argument is necessary (otherwise it is implicitly assumed the WASM memory is always zero'd), and in substrate-wasm-builder it is actually passed. The build script for the production runtime explicitly calls import_memory:

WasmBuilder::new()
    .with_current_project()
    .export_heap_base()
    .import_memory()
    .build()

and the substrate-wasm-builder does set RUSTFLAGS with that argument:

pub fn import_memory(mut self) -> Self {
    self.rust_flags.push("-C link-arg=--import-memory".into());
    self
}
// ...
    build_cmd.args(&["rustc", "--target=wasm32-unknown-unknown"])
// ...
        .env("RUSTFLAGS", rustflags.clone())

But it doesn't work. Why? We're calling the substrate-wasm-builder from inside of a build script, which means that the host cargo exports an empty CARGO_ENCODED_RUSTFLAGS environment variable while the build script is running, which overrides the RUSTFLAGS we're setting inside of the build script. The cargo subprocess responsible for building the WASM runtime then ignores our RUSTFLAGS and uses the empty CARGO_ENCODED_RUSTFLAGS instead, so the -C link-arg=--import-memory is effectively not enabled, so there's no BSS section present, so the statics are not cleared when fast instance reuse is enabled.

The good news is that Basti has already inadvertently fixed the issue with the CARGO_ENCODED_RUSTFLAGS in #9637:

     // exclusive). The runner project is created in `CARGO_TARGET_DIR` and executing it will
     // create a sub target directory inside of `CARGO_TARGET_DIR`.
     .env_remove("CARGO_TARGET_DIR")
+    // As we are being called inside a build-script, this env variable is set. However, we set
+    // our own `RUSTFLAGS` and thus, we need to remove this. Otherwise cargo favors this
+    // env variable.
+    .env_remove("CARGO_ENCODED_RUSTFLAGS")
     // We don't want to call ourselves recursively
     .env(crate::SKIP_BUILD_ENV, "");

So with that PR merged the runtime is once again built with the necessary flag and is compatible with our current fast instance reuse code. (@AurevoirXavier @h4x3rotab FYI you'll probably want to at least backport this patch into your codebases and rebuild your runtimes.)

Anyway, since using the fast instance reuse codepath on any runtime built without the --import-memory flag is fundamentally broken and unsafe we should at least modify our WASM executor so that it checks whether the runtime was properly built and forcefully disable fast instance reuse if it is. @bkchr @pepyakin Sounds good? If it is I'll whip up a PR. (Yes, we might remove the fast instance reuse codepath anyway, but that'll probably take a little longer to evaluate/benchmark, while checking whether the memory is imported and disabling the instance reuse should be relatively simple.)

@bkchr
Copy link
Member

bkchr commented Nov 16, 2021

The good news is that Basti has already inadvertently fixed the issue with the CARGO_ENCODED_RUSTFLAGS in #9637:

Now as you say this :D I have done this change there not inadvertently. If I remember correctly, the fast instance reuse test failed and I have fixed it...

Anyway, since using the fast instance reuse codepath on any runtime built without the --import-memory flag is fundamentally broken and unsafe we should at least modify our WASM executor so that it checks whether the runtime was properly built and forcefully disable fast instance reuse if it is

Sounds good to me. We should also make import-memory the default of wasm builder and not configurable.

@AurevoirXavier
Copy link
Contributor Author

AurevoirXavier commented Nov 16, 2021

@koute Any other way to pass the import-memory to rustc.
Our dependencies still falling behind lots of commits.

@bkchr
Copy link
Member

bkchr commented Nov 16, 2021

@AurevoirXavier you need this patch: 73acb2a

@koute
Copy link
Contributor

koute commented Nov 16, 2021

@koute Any other way to pass the import-memory into rustc. Our dependencies still falling behind lots of commit.

You don't have to upgrade the whole substrate; just apply Basti's patch (which I've highlighted in my previous post - the single extra .env_remove("CARGO_ENCODED_RUSTFLAGS") in utils/wasm-builder/src/wasm_project.rs) and you should be good to go. (You can then rebuild your runtime and use wasm-dis on your .wasm runtime blob to make sure the (data $.bss ...) is present.)

[edit]Ninja'd by Basti (:[/edit]

@AurevoirXavier
Copy link
Contributor Author

@AurevoirXavier you need this patch: 73acb2a

Aha, yes! I have just thought about that. Thanks!

@koute
Copy link
Contributor

koute commented Nov 18, 2021

Just a PSA - I said that disabling fast instance reuse is one possible workaround for this too; do not disable it. There's a bug in wasmtime (or possibly in our code; I'm still investigating) which will crash the node after 10k runtime invocations if the instance reuse is disabled. Until I fix that instance reuse is absolutely necessary when using a compiled runtime.

@pepyakin
Copy link
Contributor

10k instances? That sounds as if the pooling strategy (#10244) was already enabled... At least I vaguely remember this limit from there. Likely I am wrong

@koute
Copy link
Contributor

koute commented Nov 18, 2021

10k instances? That sounds as if the pooling strategy (#10244) was already enabled... At least I vaguely remember this limit from there. Likely I am wrong

It's because we use the same Store when creating each new instance, which according to wasmtime docs:

A Store is intended to be a short-lived object in a program. No form of GC is implemented at this time so once an instance is created within a Store it will not be deallocated until the Store itself is dropped. This makes Store unsuitable for creating an unbounded number of instances in it because Store will never release this memory.

So this not only results in a memory leak, but also essentially breaks the executor after 10k invocations. I'm currently in the process of fixing that, and after that we can turn on the pooling (which requires us to actually drop the Store anyway).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants