Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Collators stop producing blocks #384

Closed
crystalin opened this issue Mar 30, 2021 · 8 comments
Closed

Collators stop producing blocks #384

crystalin opened this issue Mar 30, 2021 · 8 comments

Comments

@crystalin
Copy link

crystalin commented Mar 30, 2021

We have our stagenet running network using latest rococo-v1 (#9d89ed65), containing 4 validators, 4 collators, 2 bootnodes, 2 rpc nodes.
After running for ~1 day, the collators stop producing blocks (not all at the same height but not too far). We observed no more "Prepared for block ....." message

I purged one of collator and re-run it with -l'info,rpc=trace,sycollator=trace,collation=trace,sync=trace,parachain=trace'
moon.log

@crystalin
Copy link
Author

crystalin commented Mar 30, 2021

To reproduce it (we will keep the network in that state for 1 or 2 days):

docker run -p 9933:9933 -p 9944:9944 -p 30334:30334 -p 30333:30333 \
 purestake/moonbase-stagenet:sha-941a7aac-3 /moonbase-parachain/moonbeam \
 --collator \
 --author-id 0x62d2e7324f9274fac3893a59aff8e944a323a495 \
 --chain /moonbase-parachain/parachain-raw-specs.json \
 --base-path=/data \
 --name="Test Collator Cumulus 1"  \
 --rpc-port 9933  \
 --ws-port 9944 \
 --execution wasm \
 --wasm-execution compiled \
 --pruning=archive  \
 -l'info,rpc=trace,sycollator=trace,collation=trace,sync=trace,parachain=trace'  \
 --  \
   --name 'Test Collator Cumulus 1' \
   --chain /moonbase-parachain/rococo-raw-specs.json \
 2>&1 | tee collator.log

@crystalin
Copy link
Author

crystalin commented Mar 31, 2021

@rphmeier I identified where it is failing to start producing block.
It happens in handle_new_activations
(https://github.com/paritytech/polkadot/blob/f89020dafeec01f949b42e7d3257b5840be47f96/node/collation-generation/src/lib.rs#L200)
The number of availability_cores is 0:

availability_cores=[]
n_validators=4
groups=([], GroupRotationInfo { session_start_block: 19036, group_rotation_frequency: 20, now: 19042 })

I believe this prevents the collator from producing blocks.

Do you have a clue why there wouldn't be any core available ?

@crystalin
Copy link
Author

crystalin commented Mar 31, 2021

Using sudo to set max_validators_per_core to 2 (we have 4 validators). The collator logs are showing:

availability_cores=[Free, Free]
n_validators=4
groups=([[ValidatorIndex(0), ValidatorIndex(1)], [ValidatorIndex(2), ValidatorIndex(3)]], GroupRotationInfo { session_start_block: 19056, group_rotation_frequency: 20, now: 19061 })
core is free. Keep going. core_idx=0
core is free. Keep going. core_idx=1

Still not producing blocks but at least it is going one step further

(I checked against a freshly restarted network, here are the expected logs)

availability_cores=[Scheduled(ScheduledCore { para_id: Id(1000), collator: None })] 
n_validators=2 
groups=([[ValidatorIndex(0), ValidatorIndex(1)]], GroupRotationInfo { session_start_block: 40, group_rotation_frequency: 20, now: 45 })

@crystalin
Copy link
Author

Could this be related to the new rococo-v1 having leases/auctions/... ? We registered our parachain with sudoScheduleParaInitialize

@crystalin crystalin changed the title Collator stops producing blocks Collators stop producing blocks Mar 31, 2021
@crystalin
Copy link
Author

crystalin commented Mar 31, 2021

This happened twice on our network, each time after a purge it happens around block ~6200 (6300 the first time and 6266 the second time), which I think is not a coincidence.

03/26 - 15:25:48: imported parachain #1 (at relaychain block #127)
03/27 - 15:14:36: last imported parachain #6266 (at relaychain block #14413)

03/29 - 14:25:06: imported parachain #1 (at relaychain block #110)
03/30 - 14:25:54: last imported parachain #6307 (at relaychain block #14410)

@JoshOrndorff
Copy link
Contributor

The relay chain blocks are even closer together. 14400 is how many relay blocks we expect in one day. Let's look through our rely chain constants and see if any are set to one day. If we find anything suspicious, let's lower that parameter and see if we can reproduce this in less than one day.

@crystalin
Copy link
Author

@rphmeier You mentioned Leasing is on a different layer and should not be related to this issue, but looking at Rococo, the only data that I find using 14400 blocks is https://github.com/paritytech/polkadot/blob/e8050450b71aef41210f7fa5ec2bca6d494ef610/runtime/rococo/src/lib.rs#L722

pub const LeasePeriod: BlockNumber = 1 * DAYS;

@crystalin
Copy link
Author

crystalin commented Apr 1, 2021

Mystery solved.
The parachain after the lease was disabled after the lease (the parachain appears as parathread in the paraLifecycles).
The solution to enforce a longer period is to use the sudo slot::ForceLease

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants