Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Don't drop messages. RouteBack cache update #2841

Merged
merged 7 commits into from
Jun 18, 2020

Conversation

mfornet
Copy link
Member

@mfornet mfornet commented Jun 12, 2020

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the responses are dropped, since the route back hash was
dropped from the cache.

This PR also introduce cache strategy proposed here

Fix #2756

Test plan

  • Reproduced scenario where several routed messages were dropped locally, and tested with new cache and with old cache. The fix was the size increase.
  • Unit tests to check the new strategy is working properly.

@gitpod-io
Copy link

gitpod-io bot commented Jun 12, 2020

@mfornet mfornet force-pushed the route_back_cache branch from 1e1509f to 9caf3b0 Compare June 12, 2020 04:21
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should in general avoid negative numbers as much as possible, especially when it is not really needed.

chain/network/src/cache.rs Outdated Show resolved Hide resolved
{
{
let records = entry.get_mut();
let first = records.iter().next().cloned().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that you are using BTreeSet as a heap? If so, why bother with the negative number trick and not use BinaryHeap instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, because I need to insert/remove arbitrary elements from the collection.

chain/network/src/cache.rs Outdated Show resolved Hide resolved
@mfornet mfornet force-pushed the route_back_cache branch from d2a2ae7 to 2eb0d0b Compare June 12, 2020 17:04
@mfornet mfornet requested a review from bowenwang1996 June 12, 2020 17:04
}

fn remove_frequent(&mut self) {
let (mut size, target) = self.size_per_target.iter().next().cloned().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the cache is empty wouldn't this line crash the node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this function is only called when the cache is full so it should have at least one element. There is an assertion that cache capacity should be a positive number.

chain/network/src/cache.rs Outdated Show resolved Hide resolved
chain/network/src/cache.rs Show resolved Hide resolved
@mfornet mfornet force-pushed the route_back_cache branch from 2eb0d0b to 2fedb67 Compare June 12, 2020 22:20
@mfornet mfornet requested a review from bowenwang1996 June 12, 2020 22:20
@SkidanovAlex
Copy link
Collaborator

With this implementation once we got to is_full, we will likely on each insert trigger the folloiwng:

  1. Check if is_full, it is, so do remove_evicted, which is O(num_peers). In most cases there will be nothing to evict.
  2. Call to remove_frequent, and actually remove one entry, and is also rather expensive (multiple accesses to a BTreeMap).
  3. Insert.

It will be happening on each insert, so makes the cache pretty expensive compared to a regular cache that does one insert into a hashtable and one linked list insertion.

One simple-ish approach is to change remove_frequent to remove several (like 100) entries from the most offending peer. With a cache size of 1M and 40 peers we expect more than 25K messages from the offending peer, so removing 100 isn't much different from removing 1, it's still a tiny percentage of all the messages, but it removes the necessity to call to remove_evicted for 100 iterations, and also makes cost per removal in remove_frequent slightly lower, since iterating over the first 100 elements is cheaper than pulling the first 100 times.

With that, I would also always call remove_frequent from remove_evicted, for the reasons similar to above (remove_frequent is cheaper than remove_evicted, so doesn't add much to the cost, but makes it more predictable that the expensive evictions won't be happening more than once every 100 inserts)

@mfornet mfornet force-pushed the route_back_cache branch from 1c058cf to b2b3880 Compare June 13, 2020 00:17
}

self.size_per_target.remove(&(size, target.clone()));
// Since size has negative sign, adding 1, is equivalent to subtracting 1 from the size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is no longer correct?

mfornet added 7 commits June 18, 2020 17:43
fix #2756

Increased route back cache to avoid dropping messages which
are supposed to be routed back. The problem was that when some
nodes are syncing several chunk request are asked, and request
arrive faster than responses, effectively invalidating the cache
most of the  responses are dropped, since the route back hash was
dropped from the cache.

Note: This is a vector of attack if we rely on route-back-messages,
since malicious actor can feed the cache with new information,
effectively invalidating our current cache.

Test plan
=========
Reproduced this scenario locally, and tested with and without the cache.
This cache is resistant to malicious actors that try
to poisson the cache with useless messages.
@mfornet mfornet force-pushed the route_back_cache branch from b2b3880 to c289912 Compare June 18, 2020 21:48
@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit 9bcc37b into master Jun 18, 2020
@nearprotocol-bulldozer nearprotocol-bulldozer bot deleted the route_back_cache branch June 18, 2020 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Message loss when network rebalances
3 participants