network: Some network error occurred when fetching erasure chunk #6587

lexnv · 2024-11-21T14:33:33Z

Kusama litep2p node is spamming the following earnings after a while:

WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(a85bf9402a1fc5d3732952744ccb9ff0990e8c101d0da3cd01efdff98e26f420 (GP4tF1if...)) relay_parent=0x01f4bee6dd8c1d2397265722fcdc3ce7946ec33188e7df2d208686bf0ca9a7f9 group_index=GroupIndex(32) session_index=43676 chunk_index=ValidatorIndex(273) candidate_hash=0x6db57a3a09f7caaa617b60db14433895ca6cd9a14b3b3e475ee53f2e1abb9779 err=Network(DialFailure) traceID=145828136324240632105239557157853280405

Offhand it looks like the node is not able to dial peers: DialFailure. Pending confirmation on libp2p node.

Version deployed: version 1.16.1-ca8beaed148

Grafana link: https://grafana.teleport.parity.io/goto/K-F2Od7NR?orgId=1

cc @paritytech/sdk-node

The text was updated successfully, but these errors were encountered:

lexnv · 2024-11-21T16:41:54Z

Reproduces on libp2p as well, not isolated to litep2p:

123456782024-11-21 16:16:27.030  INFO main sub-libp2p: Running libp2p network backend
...
		
2024-11-21 16:16:30.556  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(500cc2cfa106832f103b44c90db4fe4f57dc30ef81073e0bf212c4d0574d355f (EPH935R1...)) relay_parent=0x72fd5cfacc625b24eee0db33cfff3415bb6806d0e2301b6c42db910e082ed409 group_index=GroupIndex(53) session_index=43678 chunk_index=ValidatorIndex(121) candidate_hash=0x68b93303dd6899341aaed21d4de493e8dd56472be8aeb64774fbdf02eeec7bb8 err=Network(DialFailure) traceID=139201321189557106610241167805705917416
...
	
2024-11-21 16:16:36.595  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(a6fbda06023bce68011ad74b4a520be902f18db526a27f794facf69aedca455d (GMGHKdBn...)) relay_parent=0x72fd5cfacc625b24eee0db33cfff3415bb6806d0e2301b6c42db910e082ed409 group_index=GroupIndex(78) session_index=43678 chunk_index=ValidatorIndex(121) candidate_hash=0xce2ba85e7e1cff23887357dd6ae304692e0187ff24a1b51f3a0556237bac8b81 err=Network(DialFailure) traceID=274047650827900523247050184326805390441

alindima · 2024-11-21T17:10:46Z

Are all the dial failures when dialing the same peer? maybe there are some bad validators that are not responsive

lexnv · 2024-11-22T13:24:24Z

After looking a bit into warnings, some overlaps are having the same origin=Public(.., I believe multiple peers where reached.

We were able to reproduce this for a breath moment on libp2p as well.
In the morning we have redeployed litep2p, but this issue no longer seem to surface.

It might be a side-effect of the kusama spamming that happened yesterday, we'll try to reproduce this issue in versi with extra stress testing 🙏

lexnv · 2024-11-25T09:31:14Z

Logs

Warnigns happened for a brief period of 7 minutes over the weekend in kusama-waw10 (litep2p) validator:

2024-11-24 17:44:30.647  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(7200b2a58aeb7f06a41be0c33f2bc0d38e72f48a21b0d3c7a3e7829672952e2d (F9oBd3AB...)) relay_parent=0xd355773faff27b2ea161a7407d5ae74bf269a882f570a44df5c3831cac85a9e8 group_index=GroupIndex(39) session_index=43752 chunk_index=ValidatorIndex(276) candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 err=Network(DialFailure) traceID=31545891602846748197237261148944997753

2024-11-24 17:44:34.299  WARN tokio-runtime-worker parachain::availability-recovery: Recovery of available data failed. candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 traceID=31545891602846748197237261148944997753

2024-11-24 17:44:34.299  WARN tokio-runtime-worker parachain::approval-voting: Data unavailable for candidate (0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271, Id(3344)) para_id=Id(3344) candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 traceID=31545891602846748197237261148944997753

Considering this recovered, I would lean towards authority-records not being discovered / propagated for litep2p component. Pending investigation.

Code Logic

Warning originates from availability-distribution:

polkadot-sdk/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs

Lines 437 to 451 in 9fe8cff

    
           Err(RequestError::NetworkError(err)) => { 
        
           	gum::warn_if_frequent!( 
        
           		freq: network_error_freq, 
        
           		max_rate: gum::Times::PerHour(100), 
        
           		target: LOG_TARGET, 
        
           		origin = ?validator, 
        
           		relay_parent = ?self.relay_parent, 
        
           		group_index = ?self.group_index, 
        
           		session_index = ?self.session_index, 
        
           		chunk_index = ?self.request.index, 
        
           		candidate_hash = ?self.request.candidate_hash, 
        
           		err = ?err, 
        
           		"Some network error occurred when fetching erasure chunk" 
        
           	); 
        
           	Err(TaskError::PeerError)

The request is initiated with NetworkBridgeTxMessage::SendRequests messages:

polkadot-sdk/polkadot/node/network/availability-distribution/src/requester/fetch_task/mod.rs

Lines 359 to 368 in 9fe8cff

    
           self.sender 
        
           	.send(FromFetchTask::Message( 
        
           		NetworkBridgeTxMessage::SendRequests( 
        
           			vec![requests], 
        
           			IfDisconnected::ImmediateError, 
        
           		) 
        
           		.into(), 
        
           	)) 
        
           	.await 
        
           	.map_err(|_| TaskError::ShuttingDown)?;

This is then handled by the bridge/tx component, which updates metrics and forwards it to the network service:

polkadot-sdk/polkadot/node/network/bridge/src/tx/mod.rs

Lines 323 to 331 in 9fe8cff

    
           	network_service 
        
           		.start_request( 
        
           			&mut authority_discovery_service, 
        
           			req, 
        
           			req_protocol_names, 
        
           			if_disconnected, 
        
           		) 
        
           		.await; 
        
           }

A dial failure response is provided immediately if the authority discovery mechanism cannot translate the authority ID into a Multiaddr:

polkadot-sdk/polkadot/node/network/bridge/src/network.rs

Lines 293 to 309 in 9fe8cff

    
           let mut found_peer_id = None; 
        
           // Note: `get_addresses_by_authority_id` searched in a cache, and it thus expected 
        
           // to be very quick. 
        
           for addr in authority_discovery 
        
           	.get_addresses_by_authority_id(authority) 
        
           	.await 
        
           	.into_iter() 
        
           	.flat_map(|list| list.into_iter()) 
        
           { 
        
           	let (peer_id, addr) = match parse_addr(addr) { 
        
           		Ok(v) => v, 
        
           		Err(_) => continue, 
        
           	}; 
        
           	<dyn NetworkService>::add_known_address(&**self, peer_id, addr); 
        
           	found_peer_id = Some(peer_id); 
        
           } 
        
           found_peer_id

Possible Causes

authority discovery has no valid address for this authority yet (records are still being discovered)
authority discovery provides records with "empty multiaddr" (addresses containing just /p2p/peerID are unreachable - network: Authoriy discovery records published without IP addresses #6518)
provided addresses are unreachable by litep2p backend - Timeout / Network Unrachable

alexggh · 2024-11-25T09:35:25Z

I've seen this occasionally either at startup when the network wasn't full discovered or because we aren't connected to 100% percent of the nodes, this log should tell you how well connected a given node is: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/gossip-support/src/lib.rs#L608.

lexnv · 2024-11-25T11:28:15Z

Indeed, I confirm this happened at startup after another crash related to:

network/sync: Panic on unexpected generic response protocol #6581

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs:645
This is a bug. Please report it at:
	https://github.com/paritytech/polkadot-sdk/issues/new
	
	
2024-11-24 17:42:12.298	target=syncworker=tokio-runtime-worker	
2024-11-24 17:42:12.290  WARN tokio-runtime-worker sync: Unexpected generic response protocol /ksmcc3/sync/2, strategy key StrategyKey("ChainSync")

Grafana link: https://grafana.teleport.parity.io/goto/_2BODBnHg?orgId=1

We also see the following messages at sartup, @alexggh did you see them before? Do you think they are related?

2024-11-24 17:44:06.661  WARN tokio-runtime-worker parachain::dispute-coordinator: Received msg before first active leaves update. This is not expected - message will be dropped. msg=DetermineUndisputedChain

2024-11-24 17:44:06.664  WARN tokio-runtime-worker parachain::approval-distribution: Importing locally an already known assignment message_subject=MessageSubject(0xb43df03ee43fef5d5f868424d4a3ac37b76e7dd53629ca1e17a4b612800a277

2024-11-24 17:44:10.757  WARN tokio-runtime-worker parachain::availability-store: Candidate included without being backed? candidate_hash=0xb15054b1da3a857894e278c9ca64f2491948fdf906ec1974fd09a188b0e1439f traceID=235690456815942847617030309054869992009

WARN tokio-runtime-worker parachain::approval-voting: Data unavailable for candidate

WARN tokio-runtime-worker parachain::availability-recovery: Error during recovery: Data is unavailable

2024-11-24 17:44:14.167  WARN tokio-runtime-worker parachain::availability-recovery: Recovery of available data failed. candidate_hash=0x7533a57afce459c8e4f03f71acb71d7ffd58ccdc2a6dcd569c75d5a2e2f35750 traceID=155787838988307144162983257177746513279

alexggh · 2024-11-25T13:31:57Z

We also see the following messages at sartup, @alexggh did you see them before? Do you think they are related?

Yeah, I've seen those in the past at startup as far as I can tell they are completely harmless.

lexnv added the I2-bug The node fails to follow expected behavior. label Nov 21, 2024

lexnv changed the title ~~network/litep2p: Some network error occurred when fetching erasure chunk~~ network: Some network error occurred when fetching erasure chunk Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network: Some network error occurred when fetching erasure chunk #6587

network: Some network error occurred when fetching erasure chunk #6587

lexnv commented Nov 21, 2024

lexnv commented Nov 21, 2024

alindima commented Nov 21, 2024

lexnv commented Nov 22, 2024

lexnv commented Nov 25, 2024

alexggh commented Nov 25, 2024

lexnv commented Nov 25, 2024

alexggh commented Nov 25, 2024

network: Some network error occurred when fetching erasure chunk #6587

network: Some network error occurred when fetching erasure chunk #6587

Comments

lexnv commented Nov 21, 2024

lexnv commented Nov 21, 2024

alindima commented Nov 21, 2024

lexnv commented Nov 22, 2024

lexnv commented Nov 25, 2024

Logs

Code Logic

Possible Causes

alexggh commented Nov 25, 2024

lexnv commented Nov 25, 2024

alexggh commented Nov 25, 2024