Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: Some network error occurred when fetching erasure chunk #6587

Open
lexnv opened this issue Nov 21, 2024 · 7 comments
Open

network: Some network error occurred when fetching erasure chunk #6587

lexnv opened this issue Nov 21, 2024 · 7 comments
Labels
I2-bug The node fails to follow expected behavior.

Comments

@lexnv
Copy link
Contributor

lexnv commented Nov 21, 2024

Kusama litep2p node is spamming the following earnings after a while:

WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(a85bf9402a1fc5d3732952744ccb9ff0990e8c101d0da3cd01efdff98e26f420 (GP4tF1if...)) relay_parent=0x01f4bee6dd8c1d2397265722fcdc3ce7946ec33188e7df2d208686bf0ca9a7f9 group_index=GroupIndex(32) session_index=43676 chunk_index=ValidatorIndex(273) candidate_hash=0x6db57a3a09f7caaa617b60db14433895ca6cd9a14b3b3e475ee53f2e1abb9779 err=Network(DialFailure) traceID=145828136324240632105239557157853280405

Offhand it looks like the node is not able to dial peers: DialFailure. Pending confirmation on libp2p node.

Version deployed: version 1.16.1-ca8beaed148

Grafana link: https://grafana.teleport.parity.io/goto/K-F2Od7NR?orgId=1

cc @paritytech/sdk-node

@lexnv lexnv added the I2-bug The node fails to follow expected behavior. label Nov 21, 2024
@lexnv
Copy link
Contributor Author

lexnv commented Nov 21, 2024

Reproduces on libp2p as well, not isolated to litep2p:

123456782024-11-21 16:16:27.030  INFO main sub-libp2p: Running libp2p network backend
...
		
2024-11-21 16:16:30.556  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(500cc2cfa106832f103b44c90db4fe4f57dc30ef81073e0bf212c4d0574d355f (EPH935R1...)) relay_parent=0x72fd5cfacc625b24eee0db33cfff3415bb6806d0e2301b6c42db910e082ed409 group_index=GroupIndex(53) session_index=43678 chunk_index=ValidatorIndex(121) candidate_hash=0x68b93303dd6899341aaed21d4de493e8dd56472be8aeb64774fbdf02eeec7bb8 err=Network(DialFailure) traceID=139201321189557106610241167805705917416
...
	
2024-11-21 16:16:36.595  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(a6fbda06023bce68011ad74b4a520be902f18db526a27f794facf69aedca455d (GMGHKdBn...)) relay_parent=0x72fd5cfacc625b24eee0db33cfff3415bb6806d0e2301b6c42db910e082ed409 group_index=GroupIndex(78) session_index=43678 chunk_index=ValidatorIndex(121) candidate_hash=0xce2ba85e7e1cff23887357dd6ae304692e0187ff24a1b51f3a0556237bac8b81 err=Network(DialFailure) traceID=274047650827900523247050184326805390441

@lexnv lexnv changed the title network/litep2p: Some network error occurred when fetching erasure chunk network: Some network error occurred when fetching erasure chunk Nov 21, 2024
@alindima
Copy link
Contributor

Are all the dial failures when dialing the same peer? maybe there are some bad validators that are not responsive

@lexnv
Copy link
Contributor Author

lexnv commented Nov 22, 2024

After looking a bit into warnings, some overlaps are having the same origin=Public(.., I believe multiple peers where reached.

We were able to reproduce this for a breath moment on libp2p as well.
In the morning we have redeployed litep2p, but this issue no longer seem to surface.

It might be a side-effect of the kusama spamming that happened yesterday, we'll try to reproduce this issue in versi with extra stress testing 🙏

@lexnv
Copy link
Contributor Author

lexnv commented Nov 25, 2024

Logs

Warnigns happened for a brief period of 7 minutes over the weekend in kusama-waw10 (litep2p) validator:

2024-11-24 17:44:30.647  WARN tokio-runtime-worker parachain::availability-distribution: Some network error occurred when fetching erasure chunk origin=Public(7200b2a58aeb7f06a41be0c33f2bc0d38e72f48a21b0d3c7a3e7829672952e2d (F9oBd3AB...)) relay_parent=0xd355773faff27b2ea161a7407d5ae74bf269a882f570a44df5c3831cac85a9e8 group_index=GroupIndex(39) session_index=43752 chunk_index=ValidatorIndex(276) candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 err=Network(DialFailure) traceID=31545891602846748197237261148944997753

2024-11-24 17:44:34.299  WARN tokio-runtime-worker parachain::availability-recovery: Recovery of available data failed. candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 traceID=31545891602846748197237261148944997753

2024-11-24 17:44:34.299  WARN tokio-runtime-worker parachain::approval-voting: Data unavailable for candidate (0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271, Id(3344)) para_id=Id(3344) candidate_hash=0x17bb8489b1762f6952f92c32af521d7984deafbb605372381352a316eedc3271 traceID=31545891602846748197237261148944997753

Considering this recovered, I would lean towards authority-records not being discovered / propagated for litep2p component. Pending investigation.

Code Logic

Warning originates from availability-distribution:

Err(RequestError::NetworkError(err)) => {
gum::warn_if_frequent!(
freq: network_error_freq,
max_rate: gum::Times::PerHour(100),
target: LOG_TARGET,
origin = ?validator,
relay_parent = ?self.relay_parent,
group_index = ?self.group_index,
session_index = ?self.session_index,
chunk_index = ?self.request.index,
candidate_hash = ?self.request.candidate_hash,
err = ?err,
"Some network error occurred when fetching erasure chunk"
);
Err(TaskError::PeerError)

The request is initiated with NetworkBridgeTxMessage::SendRequests messages:

self.sender
.send(FromFetchTask::Message(
NetworkBridgeTxMessage::SendRequests(
vec![requests],
IfDisconnected::ImmediateError,
)
.into(),
))
.await
.map_err(|_| TaskError::ShuttingDown)?;

This is then handled by the bridge/tx component, which updates metrics and forwards it to the network service:

network_service
.start_request(
&mut authority_discovery_service,
req,
req_protocol_names,
if_disconnected,
)
.await;
}

A dial failure response is provided immediately if the authority discovery mechanism cannot translate the authority ID into a Multiaddr:

let mut found_peer_id = None;
// Note: `get_addresses_by_authority_id` searched in a cache, and it thus expected
// to be very quick.
for addr in authority_discovery
.get_addresses_by_authority_id(authority)
.await
.into_iter()
.flat_map(|list| list.into_iter())
{
let (peer_id, addr) = match parse_addr(addr) {
Ok(v) => v,
Err(_) => continue,
};
<dyn NetworkService>::add_known_address(&**self, peer_id, addr);
found_peer_id = Some(peer_id);
}
found_peer_id

Possible Causes

  • authority discovery has no valid address for this authority yet (records are still being discovered)
  • authority discovery provides records with "empty multiaddr" (addresses containing just /p2p/peerID are unreachable - network: Authoriy discovery records published without IP addresses #6518)
  • provided addresses are unreachable by litep2p backend - Timeout / Network Unrachable

@alexggh
Copy link
Contributor

alexggh commented Nov 25, 2024

I've seen this occasionally either at startup when the network wasn't full discovered or because we aren't connected to 100% percent of the nodes, this log should tell you how well connected a given node is: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/gossip-support/src/lib.rs#L608.

@lexnv
Copy link
Contributor Author

lexnv commented Nov 25, 2024

Indeed, I confirm this happened at startup after another crash related to:

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs:645
This is a bug. Please report it at:
	https://github.com/paritytech/polkadot-sdk/issues/new
	
	
2024-11-24 17:42:12.298	target=syncworker=tokio-runtime-worker	
2024-11-24 17:42:12.290  WARN tokio-runtime-worker sync: Unexpected generic response protocol /ksmcc3/sync/2, strategy key StrategyKey("ChainSync")

Grafana link: https://grafana.teleport.parity.io/goto/_2BODBnHg?orgId=1

We also see the following messages at sartup, @alexggh did you see them before? Do you think they are related?

2024-11-24 17:44:06.661  WARN tokio-runtime-worker parachain::dispute-coordinator: Received msg before first active leaves update. This is not expected - message will be dropped. msg=DetermineUndisputedChain

2024-11-24 17:44:06.664  WARN tokio-runtime-worker parachain::approval-distribution: Importing locally an already known assignment message_subject=MessageSubject(0xb43df03ee43fef5d5f868424d4a3ac37b76e7dd53629ca1e17a4b612800a277

2024-11-24 17:44:10.757  WARN tokio-runtime-worker parachain::availability-store: Candidate included without being backed? candidate_hash=0xb15054b1da3a857894e278c9ca64f2491948fdf906ec1974fd09a188b0e1439f traceID=235690456815942847617030309054869992009

WARN tokio-runtime-worker parachain::approval-voting: Data unavailable for candidate

WARN tokio-runtime-worker parachain::availability-recovery: Error during recovery: Data is unavailable

2024-11-24 17:44:14.167  WARN tokio-runtime-worker parachain::availability-recovery: Recovery of available data failed. candidate_hash=0x7533a57afce459c8e4f03f71acb71d7ffd58ccdc2a6dcd569c75d5a2e2f35750 traceID=155787838988307144162983257177746513279

@alexggh
Copy link
Contributor

alexggh commented Nov 25, 2024

We also see the following messages at sartup, @alexggh did you see them before? Do you think they are related?

Yeah, I've seen those in the past at startup as far as I can tell they are completely harmless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

3 participants