Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash - "Signal channel is terminated and empty." #1730

Closed
2 tasks done
crystalin opened this issue Sep 27, 2023 · 29 comments
Closed
2 tasks done

Crash - "Signal channel is terminated and empty." #1730

crystalin opened this issue Sep 27, 2023 · 29 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@crystalin
Copy link

crystalin commented Sep 27, 2023

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Moonbeam nodes v0.33.0 (based on polkadot v0.9.43) are crashing with the following Relaychain error.
This happens on multiple relaychain network (kusama, polkadot...)

Crash logs: moonbase_crash.log

07:21:42 [Relaychain] ♻️  Reorg on #12248369,0x2556…6c25 to #12248369,0xe211…c557, common ancestor #12248368,0x20e5…ae7c
07:21:42 [Relaychain] ✨ Imported #12248369 (0xe211…c557)
07:21:42 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
07:21:42 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] Essential task `overseer` failed. Shutting down service.
07:21:42 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
07:21:42 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
07:21:42 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
07:21:42 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
07:21:42 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("Signal channel is terminated and empty."))) }
07:21:42 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
07:22:42 juju-1b6dd3-0 polkadot[1442906]: Error: Service(Other("Essential task failed."))

Steps to reproduce

Running Moonbase alphanet (Moonbeam testnet) node:

@crystalin crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Sep 27, 2023
@crystalin crystalin changed the title Crash - "availability-distribution: Signal channel is terminated and empty." Crash - "Signal channel is terminated and empty." Sep 27, 2023
@bkchr
Copy link
Member

bkchr commented Sep 27, 2023

Yeah, it should have been fixed since quite some time: paritytech/polkadot#6656

So, maybe we are seeing some other reason here.

CC @ordian

@ScepticMatt
Copy link

I have the same issue running a zeitgeist node v0.4.0 and v0.3.11.
Incidentally, I see report on discord about the same happening when running Centrifuge.

2023-10-06 07:43:54 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-10-06 07:43:54 [Relaychain] Essential task `overseer` failed. Shutting down service.
2023-10-06 07:43:54 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminat>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-06 07:43:54 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empt>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empt>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is term>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal cha>
2023-10-06 07:43:54 [Relaychain] subsystem exited with error subsystem="collator-protocol-subsystem" err=FromOrigin { origin: "collator-protocol", source: SubsystemReceive(Generated(Context("Signal channel is terminated and>
Oct 06 07:44:54 moonbeam1 zeitgeist_parachain[1242055]: Error: Service(Other("Essential task failed."))
Oct 06 07:44:54 moonbeam1 systemd[1]: zeitgeist1.service: Main process exited, code=exited, status=1/FAILURE

@ordian
Copy link
Member

ordian commented Oct 7, 2023

Probably another reason indeed since the node doesn't seem to be in major sync:

07:21:24 [Relaychain] ✨ Imported #12248366 (0xc458…5d6a)
07:21:24 [Relaychain] ♻️  Reorg on #12248366,0xc458…5d6a to #12248366,0x87a6…6d4e, common ancestor #12248365,0x0629…5836
07:21:24 [Relaychain] ✨ Imported #12248366 (0x87a6…6d4e)
07:21:30 [Relaychain] ✨ Imported #12248367 (0x6d79…5208)
07:21:30 [Relaychain] ♻️  Reorg on #12248367,0x6d79…5208 to #12248367,0x9e30…b9e4, common ancestor #12248366,0x87a6…6d4e
07:21:30 [Relaychain] ✨ Imported #12248367 (0x9e30…b9e4)
07:21:30 [Relaychain] ✨ Imported #12248367 (0xf295…fa32)
07:21:36 [Relaychain] ✨ Imported #12248368 (0x20e5…ae7c)
07:21:36 [Relaychain] ✨ Imported #12248368 (0x36f3…e2e3)
07:21:41 [Relaychain] 💤 Idle (8 peers), best: #12248368 (0x20e5…ae7c), finalized #12248365 (0x0629…5836)
07:21:42 [Relaychain] ✨ Imported #12248369 (0x2556…6c25)
07:21:42 [Relaychain] ♻️  Reorg on #12248369,0x2556…6c25 to #12248369,0xe211…c557, common ancestor #12248368,0x20e5…ae7c
07:21:42 [Relaychain] ✨ Imported #12248369 (0xe211…c557)
07:21:42 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))

Other than surprisingly high amount of forks on relay chain, there's nothing suspicious in the logs. I'm not sure about quite a few

 HTTP serve connection failed hyper::Error(Shutdown, Os { code: 107, kind: NotConnected, message: "Transport endpoint is not connected" })

but it's probably unrelated.

@wischli
Copy link

wischli commented Oct 17, 2023

We are experiencing the same issue synching a Centrifuge fullnode from scratch (Centrifuge client v0.10.34 which uses Polkadot v0.9.38).

We are neither using block-pruning or state-pruning.

Logs

2023-10-11T22:49:30.399707966Z 2023-10-11 22:49:30.350  INFO tokio-runtime-worker pallet_collator_selection::pallet: [Parachain] assembling new collators for new session 2070 at #3724200
2023-10-11T22:49:33.705322787Z 2023-10-11 22:49:33.701  INFO tokio-runtime-worker substrate: [Relaychain] ⚙️  Syncing 251.8 bps, target=#17679704 (40 peers), best: #2170067 (0xf26e…064b), finalized #2169859 (0xd60c…2293), ⬇ 363.8kiB/s ⬆ 217.1kiB/s
2023-10-11T22:49:33.805511207Z 2023-10-11 22:49:33.781  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Syncing 204.0 bps, target=#4026064 (14 peers), best: #3724870 (0x3c9b…c76a), finalized #0 (0xb3db…9d82), ⬇ 1.8MiB/s ⬆ 1.0kiB/s
2023-10-11T22:49:34.691086522Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691271477Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691294775Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-10-11T22:49:34.691299829Z 2023-10-11 22:49:34.690 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-10-11T22:49:34.691366965Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691385283Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691390381Z 2023-10-11 22:49:34.690  WARN tokio-runtime-worker parachain::chain-selection: [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691394899Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691399568Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691404136Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-10-11T22:49:34.691412316Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691416282Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker parachain::collation-generation: [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-10-11T22:49:34.691489567Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-10-11T22:49:34.691524852Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691531137Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691534507Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691539844Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691543730Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691547497Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691551114Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691554864Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691559127Z 2023-10-11 22:49:34.691  WARN tokio-runtime-worker parachain::dispute-coordinator: [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-10-11T22:49:34.691588042Z 2023-10-11 22:49:34.691 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }

@bkchr
Copy link
Member

bkchr commented Oct 18, 2023

@wischli this Polkadot version is quite old and we already have fixed some of these issues in later releases.

@wischli
Copy link

wischli commented Oct 18, 2023

@wischli this Polkadot version is quite old and we already have fixed some of these issues in later releases.

Thanks for the quick response. That's what I figured. I am aware we are lacking behind, which should be resolved soon. Can we expect the issue to be fixed with Polkadot v0.9.43 or does it require at least v1.0.0?

@crystalin
Copy link
Author

0.9.43 is still affected as far as I can tell

@bkchr
Copy link
Member

bkchr commented Oct 20, 2023

0.9.43 is still affected as far as I can tell

@crystalin how good is this reproducible for you?

@crystalin
Copy link
Author

It seems someone is able to reproduce it without too much difficulties @bkchr : moonbeam-foundation/moonbeam#2540

@bkchr
Copy link
Member

bkchr commented Oct 30, 2023

@ordian can you please look into it?

@ordian
Copy link
Member

ordian commented Oct 30, 2023

The linked issue moonbeam-foundation/moonbeam#2540 appears to be different: SubsystemStalled("approval-distribution-subsystem"). There's some ongoing work to fix that (#1178, #1191, #1941).

Regarding #1730 (comment), this indeed should have been fixed as it was caused by unnecessary processing during major sync.

#1730 (comment) zeitgeist seem to be based on polkadot-v0.9.38, so likely same issue during major sync that was fixed later.

Re original post, it doesn't seem to be easily reproducible according to moonbeam-foundation/moonbeam#2502 (comment). I'll try to repro.

In general, this error (SubsystemStalled) happens when Subsystem's message queue is full and it's processing something for more than 10 secs.
Normally, this shouldn't happen, but maybe under heavy load, db operations might take a long time?
Or maybe it triggers hard to reproduce deadlock..

For av-store, there's a major difference in parity-db vs rocksdb implementation since for parity-db we're using a different code-path (BTree index) and a mutex on top. So I am curious if the issue is more reproducible with parity-db?

One dirty quickfix could be patching orchestra crate to increase these timeouts.
Polkadot 0.9.43 uses https://github.com/paritytech/polkadot/blob/v0.9.43/Cargo.lock#L5441 v0.0.5 of orchestra.
Here's a branch of orchestra that does that: https://github.com/paritytech/orchestra/compare/quickfix-v0.0.5-increase-timeout.

In order to understand where this slowness is coming from, maybe we could use std::backtrace::Backtraces in subsystems 🤔

@ordian
Copy link
Member

ordian commented Oct 30, 2023

moonbeam-foundation/moonbeam#2540 (comment) this seems interesting but then why isn't it runtime api subsystem being stalled 🤔

@ordian
Copy link
Member

ordian commented Oct 30, 2023

Also, just a reminder, that even 0.9.43 supports running collator with minimal relay chain client: https://github.com/paritytech/cumulus/blob/9cb14fe3ceec578ccfc4e797c4d9b9466931b711/client/service/src/lib.rs#L270 which doesn't even have av-store subsystem.

@Ciejo
Copy link

Ciejo commented Dec 13, 2023

Hello, we have the same issue since some months:

2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal cha
nnel is terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem"))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generate
d(Context("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and
 empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("S
ignal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is 
terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is termin
ated and empty."))
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChan
nel(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Co
ntext("Signal channel is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channe
l is terminated and empty."))) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal c
hannel is terminated and empty.")) }
2023-12-13 11:39:15 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-12-13 11:39:59 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: ChainApiSenderDropped }
2023-12-13 11:39:59 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: FromOrigin { origin: "db", source: Noti
fyCancellation(Canceled) } }
2023-12-13 11:40:15 Detected running(potentially stalled) tasks on shutdown:    
2023-12-13 11:40:15 Task "on-transaction-imported" (Group: transaction-pool) was still running after waiting 60 seconds to finish.    
Error: 
   0: ESC[91mOther: Essential task failed.ESC[0m

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

What we can do?

@itsbirdo
Copy link
Member

@ordian is there any further progress we can make here? The team have contacted me saying this is still an issue and their RPC is resetting daily.

Thanks in advance.

@bkchr
Copy link
Member

bkchr commented Jan 23, 2024

@helloitsbirdo we need more logs. At least 10 min before the restart would be good.

@Ciejo
Copy link

Ciejo commented Jan 23, 2024

@ordian hello, how are you? 21 minutes ago we had a new reboot:

2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal cha
nnel is terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is 
terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Sign
al channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and
 empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChan
nel(Generated(Context("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is termin
ated and empty."))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="candidate-backing-subsystem" err=FromOrigin { origin: "candidate-backing", source: OverseerExited(Generated(Context("S
ignal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem"))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generate
d(Context("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Co
ntext("Signal channel is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channe
l is terminated and empty."))) }
2024-01-23 11:39:34 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal c
hannel is terminated and empty.")) }
2024-01-23 11:39:34 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Error: 
   0: ESC[91mOther: Essential task failed.ESC[0m

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

10 minutes before the reboot we have logs like the following:

2024-01-23 11:29:35 Accepting new connection 267/10000
2024-01-23 11:29:35 Accepting new connection 268/10000
2024-01-23 11:29:35 [Relaychain] 💤 Idle (18 peers), best: #19166104 (0xa5b9…797e), finalized #19166101 (0x6b26…9532), ⬇ 702.6kiB/s ⬆ 704.6kiB/s    
2024-01-23 11:29:35 [Parachain] 💤 Idle (16 peers), best: #3812258 (0x9b33…eb87), finalized #3812256 (0x0389…49e5), ⬇ 21.2kiB/s ⬆ 31.4kiB/s    
2024-01-23 11:29:35 Accepting new connection 268/10000
2024-01-23 11:29:35 Accepting new connection 268/10000
...
2024-01-23 11:29:42 Accepting new connection 270/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 [Relaychain] ♻️  Reorg on #19166104,0xa5b9…797e to #19166105,0x3002…1bc4, common ancestor #19166103,0x570d…e12b    
2024-01-23 11:29:43 [Relaychain] ✨ Imported #19166105 (0x3002…1bc4)    
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
2024-01-23 11:29:43 Accepting new connection 269/10000
...
2024-01-23 11:30:00 [Parachain] 💤 Idle (16 peers), best: #3812259 (0x239c…ce0d), finalized #3812258 (0x9b33…eb87), ⬇ 0.2kiB/s ⬆ 0.9kiB/s    
2024-01-23 11:30:00 [Relaychain] 💤 Idle (18 peers), best: #19166107 (0x404c…fba0), finalized #19166104 (0xa389…0d4f), ⬇ 207.9kiB/s ⬆ 177.0kiB/s    
2024-01-23 11:30:00 [Relaychain] ✨ Imported #19166108 (0xc30a…af36)    
2024-01-23 11:30:00 Accepting new connection 271/10000
2024-01-23 11:30:00 [Parachain] 💔 The bootnode you want to connect to at `/ip4/104.155.25.67/tcp/30334/p2p/12D3KooWL93x4t8c6SRBwRBwTU2nvjDuLv2uXn1zbgmPSSebGNeC` provided a different peer ID `12D3KooWQYFhu2JAnSFkP97bNuf6wXAY3uvXzK1BYCnNGXJF43tp` than the one you expect `12D3KooWL93x4t8c6SRBwRBwTU2nvjDuLv2uXn1zbgmPSSebGNeC`.    
2024-01-23 11:30:00 Accepting new connection 271/10000
2024-01-23 11:30:00 Accepting new connection 272/10000
2024-01-23 11:30:00 [Parachain] ✨ Imported #3812261 (0xadba…48f2)    

Let me know what else can I give you to help us. Thank you

@ordian
Copy link
Member

ordian commented Jan 23, 2024

Thanks everyone for the logs. I've asked someone from the team to take a look at the issue. It seems to be happening more on collators than validators.

While its being investigated, for collators specifically, we have a workaround mentioned here. @skunert do we have a guide for collators running with minimal relay chain node? I guess that requires ideally running a separate relay chain rpc node locally and specifying the --relay-chain-rpc-url to point to that node. Now that I think about it, why don't we start the embedded relay chain node with minimal overseer?

@Ciejo
Copy link

Ciejo commented Jan 23, 2024

The flags that we are running is:

--chain=composable --name=composable-network-rpc-01 --listen-addr=/ip4/0.0.0.0/tcp/30334 --prometheus-external --prometheus-port 9615 --base-path /data/parachain --execution=wasm --pruning=archive --node-key-file=/node-key --rpc-external --rpc-methods safe --db paritydb --rpc-cors=all --rpc-port 9933 --in-peers 1000 --out-peers 1000 --rpc-max-connections 10000 -- --execution=wasm --base-path /data/relaychain --listen-addr=/ip4/0.0.0.0/tcp/30333 --db paritydb --sync fast

@skunert
Copy link
Contributor

skunert commented Jan 23, 2024

While its being investigated, for collators specifically, we have a workaround mentioned here. @skunert do we have a guide for collators running with minimal relay chain node? I guess that requires ideally running a separate relay chain rpc node locally and specifying the --relay-chain-rpc-url to point to that node. Now that I think about it, why don't we start the embedded relay chain node with minimal overseer?

The cumulus readme contains a description of the setup https://github.com/paritytech/polkadot-sdk/tree/master/cumulus#external-relay-chain-node . At this point, I think we don't even need to recommend running the relay chain on the same machine anymore, we have seen multiple setups that work just fine connecting to some self-hosted machine in the network.

Using the minimal overseer should for sure be possible for the collators, but has need not been implemented because it was not a priority. I think it also is appealing that the standard embedded mode just spawns an off-the shelf full-node without much modification.

@bkchr
Copy link
Member

bkchr commented Jan 23, 2024

10 minutes before the reboot we have logs like the following:

@Ciejo please provide the full logs and not just an excerpt. So, from 10minutes before the restart until the restart, all the logs.

@ordian
Copy link
Member

ordian commented Jan 23, 2024

I'd say a quickfix would be to modify the overseer gen for collator here:

overseer_gen: polkadot_service::RealOverseerGen,

to be that of the minimal relay node

that doesn't contain av-store, dispute-coordinator, etc.
cc @s0me0ne-unkn0wn

@Ciejo
Copy link

Ciejo commented Jan 23, 2024

rpc-01-logs.json
@bkchr here are the logs, sorry for the delay. Let me know what else can I do to help. Thank you!

@bkchr
Copy link
Member

bkchr commented Jan 23, 2024

I'd say a quickfix would be to modify the overseer gen for collator here:

This bug now exists since 2+ years and we don't fix it. I'm getting a little bit fed up by this situation. Maybe instead of trying more and more band-aid, we can finally go down and fix it?

@bkchr
Copy link
Member

bkchr commented Jan 23, 2024

@Ciejo which polkadot-sdk version is being used by this composable node you are running there?

@ordian
Copy link
Member

ordian commented Jan 23, 2024

This bug now exists since 2+ years and we don't fix it. I'm getting a little bit fed up by this situation. Maybe instead of trying more and more band-aid, we can finally go down and fix it?

I don't disagree the underlying root cause needs to be investigated. But to me it seems there are 2 different issues: one for validators, one for collators. The overseer/subsystems are designed mainly with validators in mind, so my point here is that instead of trying to make them work on collators well, we should not run them on collators in the first place.

The issue for validators definitely deserves proper fix (and repro). I think the whole system with timeouts is a bandaid by itself and IIRC was implemented there as a poor man's detection of deadlocks between subsystems and it that case it makes sense to shutdown. However, I don't think it's unreasonable that sometimes subsystems actually take a long time to process messages (esp if run in a VM that shares CPU resources), so I would question this mechanism in the first place. If it is caused by an actual deadlock, that definitely needs to be fixed along with a better prevention mechanisms.

@Ciejo
Copy link

Ciejo commented Jan 24, 2024

@Ciejo which polkadot-sdk version is being used by this composable node you are running there?

Hello, we are running: v0.9.43

@bkchr
Copy link
Member

bkchr commented Jan 25, 2024

Okay, that is quite old. Please upgrade to a newer version.

github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2024
)

Currently, collators and their alongside nodes spin up a full-scale
overseer running a bunch of subsystems that are not needed if the node
is not a validator. That was considered to be harmless; however, we've
got problems with unused subsystems getting stalled for a reason not
currently known, resulting in the overseer exiting and bringing down the
whole node.

This PR aims to only run needed subsystems on such nodes, replacing the
rest with `DummySubsystem`.

It also enables collator-optimized availability recovery subsystem
implementation.

Partially solves #1730.
@eskimor eskimor moved this from Backlog to Completed in parachains team board Jan 30, 2024
@eskimor
Copy link
Member

eskimor commented Jan 30, 2024

I believe this is fixed by #3061 with tickets existing for further improvements. Feel free to re-open, if I got this wrong.

@eskimor eskimor closed this as completed Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: Completed
Development

No branches or pull requests

9 participants