Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Issues during block sync #11265

Merged
merged 25 commits into from
Jun 30, 2020
Merged

Issues during block sync #11265

merged 25 commits into from
Jun 30, 2020

Conversation

grbIzl
Copy link
Collaborator

@grbIzl grbIzl commented Nov 15, 2019

Purpose
This PR aims to fix issues in the following scenario during block sync:

  1. There are two connected nodes A and F

  2. Sync is going up to block N for both nodes

  3. Node A receives notification from F about the new block with unknown parent. This notification can be caused by multiple reasons:
    -- Fork happened and new subchain started from N block, in this case F sends notification about N'+1 block (and parent in N')
    -- There were several "light" (light in terms of difficulty and import speed) blocks N+1, N+2, .. and one "heavy" block N+x after them in blocks import queue for F, in this case F may send notification only about "heavy" one N+x
    -- Notifications about previous blocks were simple lost somehow
    -- Some combination of all three
    NOTE: Current block sync was designed with the assumption, that the node sends NewBlock message, only when the import of this block is completed. This assumption was changed by @tomusdrw in his Improve block and transaction propagation #9954, now the node sends this notification before importing blocks.

  4. Node A starts retraction in order to find the beginning of possible fork, requesting headers for blocks N, N-1, N-2, N-3 etc from F

  5. Meanwhile node F has not yet completed the import of the block, the notification about was sent to A, so it responds with old blocks, that do not consider new blocks, that are being imported

  6. Node A quickly runs into N - 128 block (128 is the limit for number of requested headers)

  7. Node F finishes new blocks import

  8. Unfortunately retraction on node A went too far already and A requested 128 headers beginning from N - M block, where M > 128. As a result, F returns old 128 headers without new N' and N'+ 1
    Also A doesn't receive the notifications about the new blocks from F (that could break the circle), because A constantly requests F, considers node F as "busy" and doesn't start new sync with it correspondingly.

  9. Node A continues retraction to 0 and sync stops.

Addressed issues
This PR fixes several problems in handling of such scenario

  1. Node F should not reply to GetBlockHeaders request until fork import is finished, because such answer won't be legit already. The reply is postponed in such case.
  2. Node A should not immediately request the new block, because most probably its import is not finished yet. Instead, node A uses the parent block of came new block as a further reference point. Please note, that the same logic and assumption exists in geth: https://github.com/ethereum/go-ethereum/blob/master/eth/handler.go#L718
  3. There was an infinity loop in retraction procedure, when it came to block number 0 (it's not a surprise, that this issue was not easily discovered before, because it only matters for archive nodes). Because of this loop the retraction was stucking around 0 block (and sync stopped in result).

Achieved results
So with all fixes in place the possibility of the described scenario is significantly decreased, the node A doesn't request blocks prematurely. Almost. Unfortunately, we observed situations during the testing, when retraction on node A still started and went behind the point of no return (block N - 128). This happened in pretty weird use cases of several forks in the row, when current block sync algorithm just simply could not handle such situation. In these cases I considered described retractions as legit, in spite of its predetermined result and required time (several minutes). I just fixed it in order to be properly restarted at the end, when no fork was found (item 3 in the list of addressed issues). As a result, the sync was paused for several minutes, but automatically recovered by itself and continued.

Comments
This problem is reproduced on configuration from #10724 when A and F run locally on one machine and A is only connected to F (F connected to the whole mainnet). As a result, network latency between A and F is low and A can quickly descend down to N - M block to the stuck condition. Also node A doesn't have any other sources of blocks but node F. For regular configurations stuck is less likely but it still causes overhead by sequences of GetBlock requests emitted by nodes, trying to find the common parent in such situation.

PS Huge shoutout to @iFA88 for the invaluable help in reproducing the scenarios and debugging

@grbIzl grbIzl added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Nov 15, 2019
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/src/client/client.rs Outdated Show resolved Hide resolved
ethcore/sync/src/chain/supplier.rs Outdated Show resolved Hide resolved
/// May return false negative for longer queues
pub fn processing_fork(&self, best_block_hash: &H256) -> bool {
let processing = self.processing.read();
if processing.is_empty() || processing.len() > MAX_QUEUE_WITH_FORK {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this optimization is introduced to avoid spending too much time traversing the entire processing map, right?
If we introduce a different mapping of ParentHash -> count_of_blocks_in_processing_that_build_on_top we could do it in constant time for any parent hash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this approach, tested it and found one drawback: it doesn't work well in the case, when node is catching up blocks. In this case imported blocks are inserted into the queue one by one, as a result the suggested data structure contains N parent hashes with count 1. N could be pretty huge. So I considered it as a significant overhead. As I didn't find good solution to address both situations, I've decided to revert it and leave the current approach for now.

@grbIzl grbIzl added A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. and removed A0-pleasereview 🤓 Pull request needs code review. labels Nov 20, 2019
@ordian
Copy link
Collaborator

ordian commented Feb 11, 2020

any updates on this?

@grbIzl
Copy link
Collaborator Author

grbIzl commented Feb 19, 2020

any updates on this?

I prepared the changes, requested by @tomusdrw, but I need to test them first before returning to review phase

@ordian ordian added the A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. label Feb 21, 2020
@grbIzl grbIzl changed the title Pause GetBlockHeaders requests while importing blocks with possible fork Issues during block sync Mar 31, 2020
@grbIzl grbIzl force-pushed the HandleImportOfForks2 branch from c1b5893 to 9fe53d6 Compare March 31, 2020 13:14
self.reset();
if n == 0 {
debug_sync!(self, "Header not found, bottom line reached, resetting, last imported: {}", self.last_imported_hash);
self.reset_to_block(&best_hash, best);
Copy link
Collaborator Author

@grbIzl grbIzl Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, here the calling of simple reset is not enough, because it resets counters (last_imported_block) to the current round. In this case the rounds went too far and simple rollback inside round doesn't help. reset_to_block here actually resets block sync to its initial state

let block = Unverified::from_rlp(r.at(0)?.as_raw().to_vec())?;
let hash = block.header.hash();
let number = block.header.number();
trace!(target: "sync", "{} -> NewBlock ({})", peer_id, hash);
if number > sync.highest_block.unwrap_or(0) {
sync.highest_block = Some(number);
}
let parent_hash = block.header.parent_hash();
let difficulty: U256 = r.val_at(1)?;
// Most probably the sent block is being imported by peer right now
Copy link
Collaborator Author

@grbIzl grbIzl Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See item 2 in the addressed issues in PR's description

@grbIzl grbIzl added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. labels Apr 2, 2020
@ordian ordian requested a review from tomusdrw May 25, 2020 08:45
@grbIzl grbIzl force-pushed the HandleImportOfForks2 branch from f1e0856 to 062cbec Compare May 27, 2020 13:07
@grbIzl
Copy link
Collaborator Author

grbIzl commented May 27, 2020

Rebased to the new master cc @vorot93

ethcore/sync/src/api.rs Outdated Show resolved Hide resolved
self.imported_this_round = None;
self.round_parents = VecDeque::new();
self.target_hash = None;
self.retract_step = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about self.blocks? It seems that reset_to does reset them too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being reset inside self.reset() call.

ethcore/sync/src/block_sync.rs Outdated Show resolved Hide resolved
let block = Unverified::from_rlp(r.at(0)?.as_raw().to_vec())?;
let hash = block.header.hash();
let number = block.header.number();
trace!(target: "sync", "{} -> NewBlock ({})", peer_id, hash);
if number > sync.highest_block.unwrap_or(0) {
sync.highest_block = Some(number);
}
let parent_hash = block.header.parent_hash();
let difficulty: U256 = r.val_at(1)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why read that from RLP again if we have decoded block already? Isn't that simply part of the block.header?
EDIT: Ah, seems it's expected to be Total Difficulty, isn't that part of the decode struct anyway?

Copy link
Collaborator Author

@grbIzl grbIzl Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not. There is the usual Header there only. Also see some details in the the next question

let difficulty: U256 = r.val_at(1)?;
// Most probably the sent block is being imported by peer right now
// Use td and hash, that peer must have for now
let parent_td = difficulty - block.header.difficulty();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would be good to add overflow protection
  2. How is that non zero? Does RLP really contain total difficulty and the header just this block's difficulty? I thought that we don't really send TD over the wire.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added.
  2. Please refer to your code for sending NewBlock packet: https://github.com/openethereum/openethereum/pull/9954/files#diff-4cbf12eb53ed0891acc41ffd155a8930R478
    As I've written in the description, I've looked through geth's logic for it as well. And they are doing exactly the same (send TD and extract it)

ethcore/sync/src/chain/mod.rs Outdated Show resolved Hide resolved
@@ -672,6 +703,8 @@ pub struct ChainSync {
/// Connected peers pending Status message.
/// Value is request timestamp.
handshaking_peers: HashMap<PeerId, Instant>,
/// Requests, that can not be processed at the moment
delayed_requests: HashSet<(PeerId, u8, Vec<u8>)>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we maintain ordering of the requests? With HashSet they will get randomized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked to vec. Standard option with linked_hash_map doesn't suit because of MallocSizeOf bound

Comment on lines 434 to 437
match response {
Err(e) => Err(e),
Ok(Some((packet_id, rlp_stream))) => {
io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(
|e| debug!(target: "sync", "{:?}", error_func(e)));
Ok(())
}
_ => Ok(())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match response {
Err(e) => Err(e),
Ok(Some((packet_id, rlp_stream))) => {
io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(
|e| debug!(target: "sync", "{:?}", error_func(e)));
Ok(())
}
_ => Ok(())
}
if let Some(packet_id, rlp_stream) = response? {
io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(
|e| debug!(target: "sync", "{:?}", error_func(e)));
}
Ok(())

nit: That's a tiny bit shorter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No )) It's shorter because you omitted Err processing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is returned with ? after response, but TBH, it makes the code harder to read, so I would be in favor of keeping it as is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaa, sorry I missed this ?

ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
ethcore/verification/src/queue/mod.rs Outdated Show resolved Hide resolved
}

/// Dispatch delayed requests
/// The main difference with dispatch packet is the direct send of the responses to the peer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the peer is still connected? Not sure if it's worth it but it seems possible they went away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cleanup for disconnected peer

Comment on lines +168 to +171
_ => {
debug!(target:"sync", "Unexpected packet {} was dispatched for delayed processing", packet_id);
Ok(())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case really possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, no indeed. But I've left this for future need, if some other packet is dispatched in postponed manner

})
match result {
Err(PacketProcessError::Decoder(e)) => debug!(target:"sync", "{} -> Malformed packet {} : {}", peer, packet_id, e),
Err(PacketProcessError::ClientBusy) => sync.write().add_delayed_request(peer, packet_id, data),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ensure that only GetBlockHeadersPackets get added here instead of performing that check when processing delayed requests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to create a generic mechanism of packets with delayed response. IMO it's pretty useful and may be used in future

@@ -672,6 +700,8 @@ pub struct ChainSync {
/// Connected peers pending Status message.
/// Value is request timestamp.
handshaking_peers: HashMap<PeerId, Instant>,
/// Requests, that can not be processed at the moment
delayed_requests: Vec<(PeerId, u8, Vec<u8>)>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried that this is unbounded. How do we know it doesn't grow too big?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this by cleaning requests for disconnected peer (in terms of the other your question)

@@ -821,6 +852,21 @@ impl ChainSync {
self.active_peers = self.peers.keys().cloned().collect();
}

/// Add a request for later processing
pub fn add_delayed_request(&mut self, peer: PeerId, packet_id: u8, data: &[u8]) {
if let Some(position) = self.delayed_requests.iter().position(|request| request.0 == peer && request.1 == packet_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more efficient to use binary_search_by() here, to keep the vector sorted at all times?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it a vector due to @tomusdrw very good suggestion to keep the order of insertions. So sorting won't help (if i understand your suggestion correctly).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

@grbIzl grbIzl requested review from dvdplm and tomusdrw June 8, 2020 08:24
ethcore/sync/src/chain/handler.rs Outdated Show resolved Hide resolved
@@ -821,6 +852,21 @@ impl ChainSync {
self.active_peers = self.peers.keys().cloned().collect();
}

/// Add a request for later processing
pub fn add_delayed_request(&mut self, peer: PeerId, packet_id: u8, data: &[u8]) {
if let Some(position) = self.delayed_requests.iter().position(|request| request.0 == peer && request.1 == packet_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

@grbIzl
Copy link
Collaborator Author

grbIzl commented Jun 8, 2020

Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

@tomusdrw I thought here about spamming prevention. If node receives too many requests in such mode (processing fork), some requests will be lost, if we simply limit the size of this container. It would be a weird attack ofc, but nevertheless

@grbIzl grbIzl requested a review from tomusdrw June 17, 2020 11:31
@tomusdrw
Copy link
Collaborator

I thought here about spamming prevention. If node receives too many requests in such mode (processing fork), some requests will be lost, if we simply limit the size of this container. It would be a weird attack ofc, but nevertheless

Right, with your solution you can still spam the node, but it's limited by number of peers * 256 (packet_id size) which seems reasonable.
Can we add an extra HashSet<(PeerId, PacketId)> to avoid doing linear check here? I feel the Vec might indeed have excessive CPU cost.

Apart from that looks okay to me.

@grbIzl grbIzl force-pushed the HandleImportOfForks2 branch from e7f253a to 4968d14 Compare June 18, 2020 13:01
Copy link
Collaborator

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@vorot93 vorot93 merged commit 63e2781 into master Jun 30, 2020
@vorot93 vorot93 deleted the HandleImportOfForks2 branch June 30, 2020 14:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants