Crash - "Signal channel is terminated and empty." #1730

crystalin · 2023-09-27T19:47:50Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

Moonbeam nodes v0.33.0 (based on polkadot v0.9.43) are crashing with the following Relaychain error.
This happens on multiple relaychain network (kusama, polkadot...)

Crash logs: moonbase_crash.log

07:21:42 [Relaychain] ♻️  Reorg on #12248369,0x2556…6c25 to #12248369,0xe211…c557, common ancestor #12248368,0x20e5…ae7c
07:21:42 [Relaychain] ✨ Imported #12248369 (0xe211…c557)
07:21:42 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
07:21:42 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] Essential task `overseer` failed. Shutting down service.
07:21:42 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
07:21:42 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
07:21:42 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
07:21:42 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
07:22:42 juju-1b6dd3-0 polkadot[1442906]: Error: Service(Other("Essential task failed."))

Steps to reproduce

Running Moonbase alphanet (Moonbeam testnet) node:

The text was updated successfully, but these errors were encountered:

bkchr · 2023-09-27T20:59:21Z

Yeah, it should have been fixed since quite some time: paritytech/polkadot#6656

So, maybe we are seeing some other reason here.

CC @ordian

ScepticMatt · 2023-10-06T13:35:37Z

I have the same issue running a zeitgeist node v0.4.0 and v0.3.11.
Incidentally, I see report on discord about the same happening when running Centrifuge.

2023-10-06 07:43:54 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-10-06 07:43:54 [Relaychain] Essential task `overseer` failed. Shutting down service.
2023-10-06 07:43:54 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminat>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empt>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empt>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is term>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal cha>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="collator-protocol-subsystem" err=FromOrigin { origin: "collator-protocol", source: SubsystemReceive(Generated(Context("Signal channel is terminated and>
Oct 06 07:44:54 moonbeam1 zeitgeist_parachain[1242055]: Error: Service(Other("Essential task failed."))
Oct 06 07:44:54 moonbeam1 systemd[1]: zeitgeist1.service: Main process exited, code=exited, status=1/FAILURE

ordian · 2023-10-07T10:39:28Z

Probably another reason indeed since the node doesn't seem to be in major sync:

07:21:24 [Relaychain] ✨ Imported #12248366 (0xc458…5d6a)
07:21:24 [Relaychain] ♻️  Reorg on #12248366,0xc458…5d6a to #12248366,0x87a6…6d4e, common ancestor #12248365,0x0629…5836
07:21:24 [Relaychain] ✨ Imported #12248366 (0x87a6…6d4e)
07:21:30 [Relaychain] ✨ Imported #12248367 (0x6d79…5208)
07:21:30 [Relaychain] ♻️  Reorg on #12248367,0x6d79…5208 to #12248367,0x9e30…b9e4, common ancestor #12248366,0x87a6…6d4e
07:21:30 [Relaychain] ✨ Imported #12248367 (0x9e30…b9e4)
07:21:30 [Relaychain] ✨ Imported #12248367 (0xf295…fa32)
07:21:36 [Relaychain] ✨ Imported #12248368 (0x20e5…ae7c)
07:21:36 [Relaychain] ✨ Imported #12248368 (0x36f3…e2e3)
07:21:41 [Relaychain] 💤 Idle (8 peers), best: #12248368 (0x20e5…ae7c), finalized #12248365 (0x0629…5836)
07:21:42 [Relaychain] ✨ Imported #12248369 (0x2556…6c25)
07:21:42 [Relaychain] ♻️  Reorg on #12248369,0x2556…6c25 to #12248369,0xe211…c557, common ancestor #12248368,0x20e5…ae7c
07:21:42 [Relaychain] ✨ Imported #12248369 (0xe211…c557)
07:21:42 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))

Other than surprisingly high amount of forks on relay chain, there's nothing suspicious in the logs. I'm not sure about quite a few

 HTTP serve connection failed hyper::Error(Shutdown, Os { code: 107, kind: NotConnected, message: "Transport endpoint is not connected" })

but it's probably unrelated.

wischli · 2023-10-17T18:40:00Z

We are experiencing the same issue synching a Centrifuge fullnode from scratch (Centrifuge client v0.10.34 which uses Polkadot v0.9.38).

We are neither using block-pruning or state-pruning.

Logs

2023-10-11T22:49:30.399707966Z 2023-10-11 22:49:30.350  INFO tokio-runtime-worker pallet_collator_selection::pallet: [Parachain] assembling new collators for new session 2070 at #3724200
2023-10-11T22:49:33.705322787Z 2023-10-11 22:49:33.701  INFO tokio-runtime-worker substrate: [Relaychain] ⚙️  Syncing 251.8 bps, target=#17679704 (40 peers), best: #2170067 (0xf26e…064b), finalized #2169859 (0xd60c…2293), ⬇ 363.8kiB/s ⬆ 217.1kiB/s
2023-10-11T22:49:33.805511207Z 2023-10-11 22:49:33.781  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Syncing 204.0 bps, target=#4026064 (14 peers), best: #3724870 (0x3c9b…c76a), finalized #0 (0xb3db…9d82), ⬇ 1.8MiB/s ⬆ 1.0kiB/s
2023-10-11T22:49:34.691086522Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691271477Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691294775Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-10-11T22:49:34.691299829Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691366965Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691385283Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691390381Z 2023-10-11 22:49:34.690  WARN tokio-runtime-worker parachain::chain-selection: [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691394899Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691399568Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691404136Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-10-11T22:49:34.691412316Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691416282Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker parachain::collation-generation: [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-10-11T22:49:34.691489567Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-10-11T22:49:34.691524852Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691531137Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691534507Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691539844Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691543730Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691547497Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691551114Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691554864Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691559127Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691588042Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }

bkchr · 2023-10-18T12:52:02Z

@wischli this Polkadot version is quite old and we already have fixed some of these issues in later releases.

wischli · 2023-10-18T12:56:01Z

@wischli this Polkadot version is quite old and we already have fixed some of these issues in later releases.

Thanks for the quick response. That's what I figured. I am aware we are lacking behind, which should be resolved soon. Can we expect the issue to be fixed with Polkadot v0.9.43 or does it require at least v1.0.0?

crystalin · 2023-10-18T14:29:56Z

0.9.43 is still affected as far as I can tell

bkchr · 2023-10-20T15:41:12Z

0.9.43 is still affected as far as I can tell

@crystalin how good is this reproducible for you?

crystalin · 2023-10-30T08:43:50Z

It seems someone is able to reproduce it without too much difficulties @bkchr : moonbeam-foundation/moonbeam#2540

bkchr · 2023-10-30T08:55:43Z

@ordian can you please look into it?

ordian · 2023-10-30T10:50:57Z

The linked issue moonbeam-foundation/moonbeam#2540 appears to be different: SubsystemStalled("approval-distribution-subsystem"). There's some ongoing work to fix that (#1178, #1191, #1941).

Regarding #1730 (comment), this indeed should have been fixed as it was caused by unnecessary processing during major sync.

#1730 (comment) zeitgeist seem to be based on polkadot-v0.9.38, so likely same issue during major sync that was fixed later.

Re original post, it doesn't seem to be easily reproducible according to moonbeam-foundation/moonbeam#2502 (comment). I'll try to repro.

In general, this error (SubsystemStalled) happens when Subsystem's message queue is full and it's processing something for more than 10 secs.
Normally, this shouldn't happen, but maybe under heavy load, db operations might take a long time?
Or maybe it triggers hard to reproduce deadlock..

For av-store, there's a major difference in parity-db vs rocksdb implementation since for parity-db we're using a different code-path (BTree index) and a mutex on top. So I am curious if the issue is more reproducible with parity-db?

One dirty quickfix could be patching orchestra crate to increase these timeouts.
Polkadot 0.9.43 uses https://github.com/paritytech/polkadot/blob/v0.9.43/Cargo.lock#L5441 v0.0.5 of orchestra.
Here's a branch of orchestra that does that: https://github.com/paritytech/orchestra/compare/quickfix-v0.0.5-increase-timeout.

In order to understand where this slowness is coming from, maybe we could use std::backtrace::Backtraces in subsystems 🤔

ordian · 2023-10-30T11:24:54Z

moonbeam-foundation/moonbeam#2540 (comment) this seems interesting but then why isn't it runtime api subsystem being stalled 🤔

ordian · 2023-10-30T11:30:51Z

Also, just a reminder, that even 0.9.43 supports running collator with minimal relay chain client: https://github.com/paritytech/cumulus/blob/9cb14fe3ceec578ccfc4e797c4d9b9466931b711/client/service/src/lib.rs#L270 which doesn't even have av-store subsystem.

Ciejo · 2023-12-13T16:38:30Z

Hello, we have the same issue since some months:

2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal cha
nnel is terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem"))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generate
d(Context("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and
 empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("S
ignal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is 
terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is termin
ated and empty."))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChan
nel(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Co
ntext("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channe
l is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal c
hannel is terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-12-13 11:39:59 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: ChainApiSenderDropped }
2023-12-13 11:39:59 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: FromOrigin { origin: "db", source: Noti
fyCancellation(Canceled) } }
2023-12-13 11:40:15 Detected running(potentially stalled) tasks on shutdown:    
2023-12-13 11:40:15 Task "on-transaction-imported" (Group: transaction-pool) was still running after waiting 60 seconds to finish.    
Error: 
   0: ESC[91mOther: Essential task failed.ESC[0m

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

What we can do?

itsbirdo · 2024-01-23T09:32:35Z

@ordian is there any further progress we can make here? The team have contacted me saying this is still an issue and their RPC is resetting daily.

Thanks in advance.

bkchr · 2024-01-23T10:39:35Z

@helloitsbirdo we need more logs. At least 10 min before the restart would be good.

Ciejo · 2024-01-23T12:12:19Z

@ordian hello, how are you? 21 minutes ago we had a new reboot:

2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal cha
nnel is terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is 
terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and
 empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChan
nel(Generated(Context("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is termin
ated and empty."))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("S
ignal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem"))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generate
d(Context("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Co
ntext("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channe
l is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal c
hannel is terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Error: 
   0: ESC[91mOther: Essential task failed.ESC[0m

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

10 minutes before the reboot we have logs like the following:

2024-01-23 11:29:35 Accepting new connection 267/10000
2024-01-23 11:29:35 Accepting new connection 268/10000
2024-01-23 11:29:35 [Relaychain] 💤 Idle (18 peers), best: #19166104 (0xa5b9…797e), finalized #19166101 (0x6b26…9532), ⬇ 702.6kiB/s ⬆ 704.6kiB/s    
2024-01-23 11:29:35 [Parachain] 💤 Idle (16 peers), best: #3812258 (0x9b33…eb87), finalized #3812256 (0x0389…49e5), ⬇ 21.2kiB/s ⬆ 31.4kiB/s    
2024-01-23 11:29:35 Accepting new connection 268/10000
2024-01-23 11:29:35 Accepting new connection 268/10000
...
2024-01-23 11:29:42 Accepting new connection 270/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 [Relaychain] ♻️  Reorg on #19166104,0xa5b9…797e to #19166105,0x3002…1bc4, common ancestor #19166103,0x570d…e12b    
2024-01-23 11:29:43 [Relaychain] ✨ Imported #19166105 (0x3002…1bc4)    
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
...
2024-01-23 11:30:00 [Parachain] 💤 Idle (16 peers), best: #3812259 (0x239c…ce0d), finalized #3812258 (0x9b33…eb87), ⬇ 0.2kiB/s ⬆ 0.9kiB/s    
2024-01-23 11:30:00 [Relaychain] 💤 Idle (18 peers), best: #19166107 (0x404c…fba0), finalized #19166104 (0xa389…0d4f), ⬇ 207.9kiB/s ⬆ 177.0kiB/s    
2024-01-23 11:30:00 [Relaychain] ✨ Imported #19166108 (0xc30a…af36)    
2024-01-23 11:30:00 Accepting new connection 271/10000
2024-01-23 11:30:00 [Parachain] 💔 The bootnode you want to connect to at `/ip4/104.155.25.67/tcp/30334/p2p/12D3KooWL93x4t8c6SRBwRBwTU2nvjDuLv2uXn1zbgmPSSebGNeC` provided a different peer ID `12D3KooWQYFhu2JAnSFkP97bNuf6wXAY3uvXzK1BYCnNGXJF43tp` than the one you expect `12D3KooWL93x4t8c6SRBwRBwTU2nvjDuLv2uXn1zbgmPSSebGNeC`.    
2024-01-23 11:30:00 Accepting new connection 271/10000
2024-01-23 11:30:00 Accepting new connection 272/10000
2024-01-23 11:30:00 [Parachain] ✨ Imported #3812261 (0xadba…48f2)

Let me know what else can I give you to help us. Thank you

ordian · 2024-01-23T12:36:16Z

Thanks everyone for the logs. I've asked someone from the team to take a look at the issue. It seems to be happening more on collators than validators.

While its being investigated, for collators specifically, we have a workaround mentioned here. @skunert do we have a guide for collators running with minimal relay chain node? I guess that requires ideally running a separate relay chain rpc node locally and specifying the --relay-chain-rpc-url to point to that node. Now that I think about it, why don't we start the embedded relay chain node with minimal overseer?

Ciejo · 2024-01-23T12:48:21Z

The flags that we are running is:

--chain=composable --name=composable-network-rpc-01 --listen-addr=/ip4/0.0.0.0/tcp/30334 --prometheus-external --prometheus-port 9615 --base-path /data/parachain --execution=wasm --pruning=archive --node-key-file=/node-key --rpc-external --rpc-methods safe --db paritydb --rpc-cors=all --rpc-port 9933 --in-peers 1000 --out-peers 1000 --rpc-max-connections 10000 -- --execution=wasm --base-path /data/relaychain --listen-addr=/ip4/0.0.0.0/tcp/30333 --db paritydb --sync fast

skunert · 2024-01-23T13:06:44Z

While its being investigated, for collators specifically, we have a workaround mentioned here. @skunert do we have a guide for collators running with minimal relay chain node? I guess that requires ideally running a separate relay chain rpc node locally and specifying the --relay-chain-rpc-url to point to that node. Now that I think about it, why don't we start the embedded relay chain node with minimal overseer?

The cumulus readme contains a description of the setup https://github.com/paritytech/polkadot-sdk/tree/master/cumulus#external-relay-chain-node . At this point, I think we don't even need to recommend running the relay chain on the same machine anymore, we have seen multiple setups that work just fine connecting to some self-hosted machine in the network.

Using the minimal overseer should for sure be possible for the collators, but has need not been implemented because it was not a priority. I think it also is appealing that the standard embedded mode just spawns an off-the shelf full-node without much modification.

bkchr · 2024-01-23T15:13:43Z

10 minutes before the reboot we have logs like the following:

@Ciejo please provide the full logs and not just an excerpt. So, from 10minutes before the restart until the restart, all the logs.

ordian · 2024-01-23T16:06:04Z

I'd say a quickfix would be to modify the overseer gen for collator here:

polkadot-sdk/cumulus/client/relay-chain-inprocess-interface/src/lib.rs

Line 298 in b4dfad8

overseer_gen: polkadot_service::RealOverseerGen,

to be that of the minimal relay node

polkadot-sdk/cumulus/client/relay-chain-minimal-node/src/collator_overseer.rs

Line 85 in b4dfad8

fn build_overseer(

that doesn't contain av-store, dispute-coordinator, etc.
cc @s0me0ne-unkn0wn

Ciejo · 2024-01-23T16:23:19Z

rpc-01-logs.json
@bkchr here are the logs, sorry for the delay. Let me know what else can I do to help. Thank you!

bkchr · 2024-01-23T21:37:44Z

I'd say a quickfix would be to modify the overseer gen for collator here:

This bug now exists since 2+ years and we don't fix it. I'm getting a little bit fed up by this situation. Maybe instead of trying more and more band-aid, we can finally go down and fix it?

bkchr · 2024-01-23T21:39:25Z

@Ciejo which polkadot-sdk version is being used by this composable node you are running there?

ordian · 2024-01-23T22:18:31Z

This bug now exists since 2+ years and we don't fix it. I'm getting a little bit fed up by this situation. Maybe instead of trying more and more band-aid, we can finally go down and fix it?

I don't disagree the underlying root cause needs to be investigated. But to me it seems there are 2 different issues: one for validators, one for collators. The overseer/subsystems are designed mainly with validators in mind, so my point here is that instead of trying to make them work on collators well, we should not run them on collators in the first place.

The issue for validators definitely deserves proper fix (and repro). I think the whole system with timeouts is a bandaid by itself and IIRC was implemented there as a poor man's detection of deadlocks between subsystems and it that case it makes sense to shutdown. However, I don't think it's unreasonable that sometimes subsystems actually take a long time to process messages (esp if run in a VM that shares CPU resources), so I would question this mechanism in the first place. If it is caused by an actual deadlock, that definitely needs to be fixed along with a better prevention mechanisms.

Ciejo · 2024-01-24T14:47:32Z

@Ciejo which polkadot-sdk version is being used by this composable node you are running there?

Hello, we are running: v0.9.43

bkchr · 2024-01-25T12:56:29Z

Okay, that is quite old. Please upgrade to a newer version.

) Currently, collators and their alongside nodes spin up a full-scale overseer running a bunch of subsystems that are not needed if the node is not a validator. That was considered to be harmless; however, we've got problems with unused subsystems getting stalled for a reason not currently known, resulting in the overseer exiting and bringing down the whole node. This PR aims to only run needed subsystems on such nodes, replacing the rest with `DummySubsystem`. It also enables collator-optimized availability recovery subsystem implementation. Partially solves #1730.

eskimor · 2024-01-30T10:28:41Z

I believe this is fixed by #3061 with tickets existing for further improvements. Feel free to re-open, if I got this wrong.

crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Sep 27, 2023

crystalin mentioned this issue Sep 27, 2023

Moonbase node crashes every 10 min moonbeam-foundation/moonbeam#2502

Closed

crystalin changed the title ~~Crash - "availability-distribution: Signal channel is terminated and empty."~~ Crash - "Signal channel is terminated and empty." Sep 27, 2023

gpmayorga mentioned this issue Oct 18, 2023

Revised Centrifuge docs axelarnetwork/axelar-docs#585

Merged

eskimor added this to parachains team board Oct 20, 2023

eskimor moved this to Backlog in parachains team board Oct 20, 2023

s0me0ne-unkn0wn mentioned this issue Jan 25, 2024

Do not run unneeded subsystems on collator and its alongside node #3061

Merged

ordian mentioned this issue Jan 29, 2024

Remove the timeout system paritytech/orchestra#73

Open

bkchr mentioned this issue Jan 30, 2024

Move candidate-validation on blocking tasks #3122

Merged

eskimor moved this from Backlog to Completed in parachains team board Jan 30, 2024

eskimor closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash - "Signal channel is terminated and empty." #1730

Crash - "Signal channel is terminated and empty." #1730

crystalin commented Sep 27, 2023 •

edited

Loading

bkchr commented Sep 27, 2023

ScepticMatt commented Oct 6, 2023

ordian commented Oct 7, 2023

wischli commented Oct 17, 2023

bkchr commented Oct 18, 2023

wischli commented Oct 18, 2023 •

edited

Loading

crystalin commented Oct 18, 2023

bkchr commented Oct 20, 2023

crystalin commented Oct 30, 2023

bkchr commented Oct 30, 2023

ordian commented Oct 30, 2023 •

edited

Loading

ordian commented Oct 30, 2023

ordian commented Oct 30, 2023

Ciejo commented Dec 13, 2023

itsbirdo commented Jan 23, 2024

bkchr commented Jan 23, 2024

Ciejo commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 23, 2024

skunert commented Jan 23, 2024

bkchr commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 23, 2024

bkchr commented Jan 23, 2024

bkchr commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 24, 2024

bkchr commented Jan 25, 2024

eskimor commented Jan 30, 2024

Crash - "Signal channel is terminated and empty." #1730

Crash - "Signal channel is terminated and empty." #1730

Comments

crystalin commented Sep 27, 2023 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

bkchr commented Sep 27, 2023

ScepticMatt commented Oct 6, 2023

ordian commented Oct 7, 2023

wischli commented Oct 17, 2023

bkchr commented Oct 18, 2023

wischli commented Oct 18, 2023 • edited Loading

crystalin commented Oct 18, 2023

bkchr commented Oct 20, 2023

crystalin commented Oct 30, 2023

bkchr commented Oct 30, 2023

ordian commented Oct 30, 2023 • edited Loading

ordian commented Oct 30, 2023

ordian commented Oct 30, 2023

Ciejo commented Dec 13, 2023

itsbirdo commented Jan 23, 2024

bkchr commented Jan 23, 2024

Ciejo commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 23, 2024

skunert commented Jan 23, 2024

bkchr commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 23, 2024

bkchr commented Jan 23, 2024

bkchr commented Jan 23, 2024

ordian commented Jan 23, 2024

Ciejo commented Jan 24, 2024

bkchr commented Jan 25, 2024

eskimor commented Jan 30, 2024

crystalin commented Sep 27, 2023 •

edited

Loading

wischli commented Oct 18, 2023 •

edited

Loading

ordian commented Oct 30, 2023 •

edited

Loading