Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky sampling tests #6319

Closed
jimmygchen opened this issue Aug 28, 2024 · 5 comments
Closed

Investigate flaky sampling tests #6319

jimmygchen opened this issue Aug 28, 2024 · 5 comments
Assignees
Labels
bug Something isn't working das Data Availability Sampling infra-ci

Comments

@jimmygchen
Copy link
Member

Description

The test network sync::block_lookups::tests::sampling_with_retries is failing on CI intermittently, seems like it could be a race condition somewhere.

https://github.com/sigp/lighthouse/actions/runs/10592162826/job/29350969832?pr=6312

@dapplion
Copy link
Collaborator

dapplion commented Sep 2, 2024

I recall the flakiness of this test to increase with #6256 as a pointer to start investigating

@ackintosh ackintosh self-assigned this Sep 6, 2024
@ackintosh
Copy link
Member

When I reproduced the test failure, cx.get_custodial_peers(self.column_index) returned only one peer (a supernode), but the test expects two or more peers to retry sampling.

A solution that came to mind is to add another supernode (making a total of two) to ensure that the node can retry sampling. What do you think?

pub(crate) fn choose_peer<T: BeaconChainTypes>(
&mut self,
cx: &SyncNetworkContext<T>,
) -> Option<PeerId> {
// TODO: When is a fork and only a subset of your peers know about a block, sampling should only
// be queried on the peers on that fork. Should this case be handled? How to handle it?
let mut peer_ids = cx.get_custodial_peers(self.column_index);
peer_ids.retain(|peer_id| !self.peers_dont_have.contains(peer_id));
if let Some(peer_id) = peer_ids.choose(&mut thread_rng()) {
Some(*peer_id)
} else {
self.status = Status::NoPeers;
None
}
}

@jimmygchen
Copy link
Member Author

jimmygchen commented Sep 9, 2024

Oh I think it may also be related to #6303 - the good_custody_subnet_peer function now checks whether a peer is assigned to a subnet:

/// Returns an iterator of all good gossipsub peers that are supposed to be custodying
/// the given subnet id.
pub fn good_custody_subnet_peer(
&self,
subnet: DataColumnSubnetId,
) -> impl Iterator<Item = &PeerId> {
self.peers
.iter()
.filter(move |(_, info)| {
// The custody_subnets hashset can be populated via enr or metadata
let is_custody_subnet_peer = info.is_assigned_to_custody_subnet(&subnet);
info.is_connected() && info.is_good_gossipsub_peer() && is_custody_subnet_peer
})
.map(|(peer_id, _)| peer_id)
}

and with #6308 we no longer use ENR for choosing peers, instead we use metadata to populate peer_info.custody_subets:

if self.network_globals.spec.is_peer_das_scheduled() {
// Gracefully ignore metadata/v2 peers. Potentially downscore after PeerDAS to
// prioritize PeerDAS peers.
if let Some(custody_subnet_count) = custody_subnet_count_opt {
match self.compute_peer_custody_subnets(peer_id, custody_subnet_count) {
Ok(custody_subnets) => {
peer_info.set_custody_subnets(custody_subnets);
}

In the test rig, only the supernode has peer_info.custody_subnets populated:

if supernode {
let peer_info = self.peers.get_mut(&peer_id).expect("peer exists");
let all_subnets = (0..spec.data_column_sidecar_subnet_count)
.map(|csc| csc.into())
.collect();
peer_info.set_custody_subnets(all_subnets);
}

So this means the remaining 100 peers we added here are not useful, because custody_subnets are empty for them:

fn new_connected_peers_for_peerdas(&mut self) {
// Enough sampling peers with few columns
for _ in 0..100 {
self.new_connected_peer();
}
// One supernode peer to ensure all columns have at least one peer
self.new_connected_supernode_peer();
}

I think we can either

  • add an additional supernode as you suggested, and get rid of adding 100 peers; OR
  • update the __add_connected_peer_testing_only to set custody subnets for all peer types (similar logic to meta_data_response fn above), so every peer would have some columns too.

@dapplion
Copy link
Collaborator

update the __add_connected_peer_testing_only to set custody subnets for all peer types (similar logic to meta_data_response fn above), so every peer would have some columns too.

+1 on being as close to reality as possible

@jimmygchen
Copy link
Member Author

Fixed in #6382, thanks @ackintosh! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working das Data Availability Sampling infra-ci
Projects
None yet
Development

No branches or pull requests

3 participants