Dev wasm executor panics on block production #11132

ggwpez · 2022-03-28T16:29:58Z

I try to benchmark something and the wasm executor seems to panic for the polkadot-dev runtime by hitting a memory limit.
Failed to allocate memory: "Requested allocation size is too large" and going to fail due to allocating 66774016.

Following setup:

Generate genesis spec withbuild-spec --chain=polkadot-dev
Insert 1M accounts into the chain spec (download here)
run the node withpolkadot --chain dot-spec-1M-accs.json --base-path /tmp/dot-dev --execution wasm --wasm-execution compiled --pruning archive --force-authoring --alice --pool-limit 100000
Send 10,800 endowing transfer extrinsics (from //Alice to //Bob/i), I can provide the python script if needed.

You should see this:

Version: 0.9.18-f2ec25eb0b-x86_64-linux-gnu

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:610:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:502:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:139:18
   4: rust_begin_unwind
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
   5: core::panicking::panic_fmt
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
   6: core::result::unwrap_failed
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:1690:5
   7: tracing::span::Span::in_scope
   8: sp_io::allocator::ExtAllocatorMallocVersion1::call
   9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  10: <F as wasmtime::func::IntoFunc<T,(wasmtime::func::Caller<T>,A1),R>>::into_func::wasm_to_host_shim
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: wasmtime_runtime::traphandlers::catch_traps::call_closure
  19: wasmtime_setjmp
  20: wasmtime_runtime::traphandlers::catch_traps
  21: wasmtime::func::invoke_wasm_and_catch_traps
  22: wasmtime::func::typed::TypedFunc<Params,Results>::call
  23: sc_executor_wasmtime::instance_wrapper::EntryPoint::call
  24: sc_executor_wasmtime::runtime::perform_call
  25: <sc_executor_wasmtime::runtime::WasmtimeInstance as sc_executor_common::wasm_runtime::WasmInstance>::call
  26: sc_executor_common::wasm_runtime::WasmInstance::call_export
  27: std::panicking::try
  28: std::thread::local::LocalKey<T>::with
  29: sc_executor::native_executor::WasmExecutor<H>::with_instance::{{closure}}
  30: sc_executor::wasm_runtime::RuntimeCache::with_instance
  31: <sc_executor::native_executor::NativeElseWasmExecutor<D> as sp_core::traits::CodeExecutor>::call
  32: sp_state_machine::execution::StateMachine<B,H,Exec>::execute_aux
  33: sp_state_machine::execution::StateMachine<B,H,Exec>::execute_using_consensus_failure_handler
  34: <sc_service::client::call_executor::LocalCallExecutor<Block,B,E> as sc_client_api::call_executor::CallExecutor<Block>>::contextual_call
  35: polkadot_primitives::v2::runtime_decl_for_ParachainHost::candidate_events_call_api_at
  36: polkadot_primitives::v2::ParachainHost::candidate_events
  37: polkadot_node_core_runtime_api::make_runtime_api_request
  38: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  39: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
  40: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  41: std::thread::local::LocalKey<T>::with
  42: tokio::park::thread::CachedParkThread::block_on
  43: tokio::runtime::handle::Handle::block_on
  44: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  45: tokio::runtime::task::harness::Harness<T,S>::poll
  46: tokio::runtime::blocking::pool::Inner::run
  47: std::sys_common::backtrace::__rust_begin_short_backtrace
  48: core::ops::function::FnOnce::call_once{{vtable.shim}}
  49: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/boxed.rs:1854:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/boxed.rs:1854:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys/unix/thread.rs:108:17
  50: start_thread
             at /build/glibc-sMfBJT/glibc-2.31/nptl/pthread_create.c:477:8
  51: clone


Thread 'tokio-runtime-worker' panicked at 'Failed to allocate memory: "Requested allocation size is too large"', /home/vados/.cargo/git/checkouts/substrate-7e08433d4c370a21/fe3acf4/primitives/io/src/lib.rs:1295

This is a bug. Please report it at:

        https://github.com/paritytech/polkadot/issues/new

2022-03-23 19:54:20 going to fail due to allocating 66774016

The text was updated successfully, but these errors were encountered:

koute · 2022-04-01T09:21:00Z

I can take a look at this if you want, if you can provide the Python script necessary to reproduce this.

ggwpez · 2022-04-01T09:55:56Z

Thanks @koute, here it is killer.py.txt (+pip install substrate-interface).
I just tried it again on latest polkadot master and it still happens.
Make sure to always delete your /tmp/dot-dev folder before re-running the node so that it starts with a fresh genesis.
The node takes a minute to build the genesis block, so you could attach your debugger after thats done, otherwise it probably takes longer.

Just to recap:

Download the spec and the python script from above
Delete /tmp/dot-dev or pick a non-existing dir
polkadot --chain dot-spec-1M-accs.json --base-path /tmp/dot-dev --execution wasm --wasm-execution compiled --pruning archive --force-authoring --alice --pool-limit 100000
Wait until it starts producing blocks
Start the python script
Block producer should panic

koute · 2022-04-11T13:14:13Z

Stack trace from the runtime (after my two PRs it now gets printed out):

0: 0x47d51e - <unknown>!sp_io::allocator::extern_host_function_impls::malloc::hed26f12cb611fdea
1: 0x703e5 - <unknown>!alloc::raw_vec::finish_grow::h5b508086d31786ec
2: 0x70ec8 - <unknown>!alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle::h196b09d9bc226fe8
3: 0x3a21c - <unknown>!parity_scale_codec::codec::decode_vec_with_len::h45dc2aeb96b8057c
4: 0x399d3b - <unknown>!frame_support::storage::unhashed::get::hefd8b3286c7e61dc
5: 0x399a7c - <unknown>!frame_system::<impl frame_system::pallet::Pallet<T>>::read_events_no_consensus::h3294c8ce39a180eb
6: 0x1ec690 - <unknown>!polkadot_runtime_parachains::runtime_api_impl::v2::candidate_events::heb324a904e8907db
7: 0x3e9c23 - <unknown>!ParachainHost_candidate_events

So it calls this function from polkadot/runtime/parachains/src/runtime_api_impl/v2.rs:

/// Implementation for the `candidate_events` function of the runtime API.
// NOTE: this runs without block initialization, as it accesses events.
// this means it can run in a different session than other runtime APIs at the same block.
pub fn candidate_events<T, F>(extract_event: F) -> Vec<CandidateEvent<T::Hash>>
where
        T: initializer::Config,
        F: Fn(<T as frame_system::Config>::Event) -> Option<inclusion::Event<T>>,
{
        use inclusion::Event as RawEvent;

        <frame_system::Pallet<T>>::read_events_no_consensus()
                .into_iter()
                .filter_map(|record| extract_event(record.event))
                .map(|event| match event {
                        RawEvent::<T>::CandidateBacked(c, h, core, group) =>
                                CandidateEvent::CandidateBacked(c, h, core, group),
                        RawEvent::<T>::CandidateIncluded(c, h, core, group) =>
                                CandidateEvent::CandidateIncluded(c, h, core, group),
                        RawEvent::<T>::CandidateTimedOut(c, h, core) =>
                                CandidateEvent::CandidateTimedOut(c, h, core),
                        RawEvent::<T>::__Ignore(_, _) => unreachable!("__Ignore cannot be used"),
                })
                .collect()
}

which in turn calls this from substrate/frame/system/src/lib.rs:

/// Get the current events deposited by the runtime.
///
/// Should only be called if you know what you are doing and outside of the runtime block
/// execution else it can have a large impact on the PoV size of a block.
pub fn read_events_no_consensus() -> Vec<EventRecord<T::Event, T::Hash>> {
    Events::<T>::get()
}

So it looks like it tries to decode this value, and it's too big, so it can't allocate a big enough Vec to fit all of it.

Curiously, we can see this comment where it is defined:

/// Events deposited for the current block.
///
/// NOTE: This storage item is explicitly unbounded since it is never intended to be read
/// from within the runtime.
#[pallet::storage]
pub(super) type Events<T: Config> =
    StorageValue<_, Vec<EventRecord<T::Event, T::Hash>>, ValueQuery>;

"it is never intended to be read from within the runtime" and yet it is? (Unless "by runtime" here means "during normal block execution"?)

Some potential ways we could fix this:

Don't call the runtime to get this; fetch it directly from the storage.
Call into the runtime and return this data in chunks.
Check how many entries there are and return an error if there are too many.
Do not let this value grow this big in the first place.
Something else?

Alas it's outside of my area of expertise to be able to tell what's the most appropriate way to fix this.

kianenigma · 2022-04-11T14:55:39Z

This is not really a surprise. When you shove a million transfers into one (or few) blocks, then you are going to have millions of events deposited (I think we deposit like 4 events per transfer), and reading them will cause the runtime to crash.

Interestingly, because we ::append, inserting will not fail, but reading will.

As the name suggests, events should not, exactly for this reason, be read in the runtime. I am not why and how it is done now in the parachains runtime, but it is def. something to be reconsidered. cc @ordian @rphmeier.

kianenigma · 2022-04-11T15:09:02Z

UPDATE: this code is reading events in a runtime api call, not in a consensus-context runtime execution code. So while someone should look at the implications of this runtime api potentially failing, this has no severe consensus issues.

bkchr · 2022-04-11T15:10:43Z

@koute @ggwpez how many transactions are inside the block when this fails?

Answering myself, around 6800 transactions.

bkchr · 2022-04-11T15:20:03Z

2. Call into the runtime and return this data in chunks.

I think I would go with this. Having some kind of special interface stream_iter or something that reads from the storage in chunks.

However, before we do this we should check the size of the Polkadot runtime level Event with size_of. If we would add some boxing we could probably reduce the in memory size as we have done this for Call already.

@ggwpez could you get us the size with mem::size_of in wasm?

ggwpez · 2022-04-11T16:05:25Z

could you get us the size with mem::size_of in wasm?

@bkchr I assume frame_system::Event since it is being decoded here:
Event [byte]: Native 1224, Wasm 952
EventRecord<T::Event, T::Hash> [byte]: Native 1256, Wasm 976

Someone go 🥊 this.

PS:
The error message above states that it failed to allocate 66774016 byte, which is evenly divisible by 976.
66774016 / 976 = 68416, so that would be 68k events.
On native executor I got about 54.4k events, so this is reasonable.

bkchr · 2022-04-11T16:20:07Z

PS:
The error message above states that it failed to allocate 66774016 byte, which is evenly divisible by 976.
66774016 / 976 = 68416, so that would be 68k events.
On native executor I got about 54.4k events, so this is reasonable.

You also need to take into account the allocation strategy. They are probably always increasing by 2 or something when they hit the upper size of the vector. Otherwise you are running constantly into allocations. So, I would assume it is more around 34-40k events.

xlc · 2022-04-11T22:57:20Z

I have brought this up multiple times but can we just not having all the events stored in a single storage key? It is not light client / XCM friendly anyway. No one is using event topics but we should be using it.

The only way to prove something have happened on other chain is via events. But we don't really want to send all the events in a single block just to prove a single event.

rphmeier · 2022-04-13T00:19:23Z

Well, we could also optimize this on the parachains runtime side by not using FRAME events for CandidateEvents (cc @drahnr) and instead having a separate storage entry for tracking candidate events. The migration from one mode to the other would need to be done carefully, since the first block where the new runtime code is used might have a parent that didn't use the new storage entry. I'd recommend a two-part migration off of FRAME events: a first change where the new storage entry is introduced, and a following change (after the first is already running on Polkadot) where the runtime API is fully switched off of FRAME events.

bkchr · 2022-04-13T06:33:07Z

@rphmeier we already have a fix for this. So not sure you need this.

@xlc I think we don't have a proper issue and idea for this yet?

xlc · 2022-04-13T09:58:57Z

This is one of the previous discussion #9480 (comment)

I don't see a simple solution to this so we really need to spend sometime to think about it and come up some potential solutions for further discussion.

I asked Gav about it last year May and what I got is "we may refactor the events system with a new cryptographic structure in the not too distant future."

ggwpez added I5-tests Tests need fixing, improving or augmenting. U3-nice_to_have Issue is worth doing eventually. labels Mar 28, 2022

koute self-assigned this Apr 1, 2022

This was referenced Apr 8, 2022

Improve error handling in approval voting block import paritytech/polkadot#5283

Merged

Catch panics on the FFI boundary between the runtime and the host for wasmtime #11189

Merged

ggwpez added I3-bug The node fails to follow expected behavior. and removed I5-tests Tests need fixing, improving or augmenting. labels Apr 11, 2022

ggwpez mentioned this issue Apr 12, 2022

Fix WASM block producer panic #11206

Merged

xlc mentioned this issue Aug 24, 2023

Vision: Redesign Event system paritytech/polkadot-sdk#278

Open

paritytech-processbot bot closed this as completed in #11206 Apr 13, 2022

gpestana mentioned this issue Aug 20, 2024

chain-spec building failure due to WASM allocation size paritytech/polkadot-sdk#5419

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev wasm executor panics on block production #11132

Dev wasm executor panics on block production #11132

ggwpez commented Mar 28, 2022

koute commented Apr 1, 2022

ggwpez commented Apr 1, 2022

koute commented Apr 11, 2022

kianenigma commented Apr 11, 2022

kianenigma commented Apr 11, 2022

bkchr commented Apr 11, 2022 •

edited

Loading

bkchr commented Apr 11, 2022

ggwpez commented Apr 11, 2022 •

edited

Loading

bkchr commented Apr 11, 2022

xlc commented Apr 11, 2022 •

edited

Loading

rphmeier commented Apr 13, 2022 •

edited

Loading

bkchr commented Apr 13, 2022

xlc commented Apr 13, 2022

Dev wasm executor panics on block production #11132

Dev wasm executor panics on block production #11132

Comments

ggwpez commented Mar 28, 2022

koute commented Apr 1, 2022

ggwpez commented Apr 1, 2022

koute commented Apr 11, 2022

kianenigma commented Apr 11, 2022

kianenigma commented Apr 11, 2022

bkchr commented Apr 11, 2022 • edited Loading

bkchr commented Apr 11, 2022

ggwpez commented Apr 11, 2022 • edited Loading

bkchr commented Apr 11, 2022

xlc commented Apr 11, 2022 • edited Loading

rphmeier commented Apr 13, 2022 • edited Loading

bkchr commented Apr 13, 2022

xlc commented Apr 13, 2022

bkchr commented Apr 11, 2022 •

edited

Loading

ggwpez commented Apr 11, 2022 •

edited

Loading

xlc commented Apr 11, 2022 •

edited

Loading

rphmeier commented Apr 13, 2022 •

edited

Loading