Message loss when network rebalances #2756

bowenwang1996 · 2020-05-31T03:37:47Z

In #2490, we introduced network balancing to avoid reaching maximum number of peers. However, it comes at a cost -- when a message is being routed, it can get dropped due to the connection between two nodes on the route getting dropped when they rebalance the network connection. It makes the system slower overall. In some cases, the requester will re-request or request from other peers (for example for partial encoded chunks). In other cases, the message is simply dropped. This can have some fairly undesirable consequences such as block producers not getting enough approvals to produce their blocks. It seems that this also allows an attacker to initiate an eclipse attack on a node that can have some profound impact on the overall network topology.
It is not clear to me, however, how to best solve this issue. Some ideas:

When we send a message, allow the sender to specify how many times they want to send it and send the messages through different routes.
When we rebalance network connections, we should minimize the decrease in the number of reachable peers.
When we send or route a message, instead of using round robin on the shortest paths, we do something like round robin on the peers with most routes to the destination, or the peer with most connection to the destination among the shortest paths, or a combination of the two.

mfornet · 2020-05-31T04:24:52Z

It is not clear why routed message got lost during rebalance. If nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target, message are only dropped when no route to target is found or when message have been routed through more than 100 hops.

How do you know that messages are being dropped, because of logs?

bowenwang1996 · 2020-05-31T05:21:40Z

nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target

Is it guaranteed in any way? I didn't see it.

mfornet · 2020-05-31T20:42:23Z

nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target

Is it guaranteed in any way? I didn't see it.

What is guaranteed is that nodes will drop connections if the number of active peers go over some threshold (currently 35). This a high enough number to be confident that the network should remain connected in the whole process.

Again, are you strictly confident that messages are being dropped (do you saw it on logs)? If that is the case, maybe the root cause is another.

mfornet · 2020-06-01T21:41:44Z

@bowenwang1996 Before jumping into this issue we should investigate further how many message are being dropped and the real cause of those messages being dropped.

I mean, for sure they are dropped if there is no route available (it is possible also that message is being routed in a cycle and gets dropped after TTL=0, but it is very unlikely unless a bug or very pathological behavior from peers in the network).

bowenwang1996 · 2020-06-02T00:24:50Z

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

mfornet · 2020-06-02T00:30:35Z

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

Sounds good, let's see if that helps us. I'll add it as a Prometheus counter.

mfornet · 2020-06-06T06:08:05Z

@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type?

@bowenwang1996 did this in #2804 Check the graph Dropped messages of one of the nodes here

bowenwang1996 · 2020-06-06T15:48:40Z

@mfornet that's a lot of dropped messages. Is what shown in the graph cumulative?

mfornet · 2020-06-06T18:14:57Z

@mfornet that's a lot of dropped messages. Is what shown in the graph cumulative?

Yes, it is cumulative.

I see how ~10K messages were dropped in 20minutes, and since then it has been dropping messages slowly, mostly PartialEncodedChunkResponse, ~2K messages in 10 hours.

mfornet · 2020-06-11T22:25:22Z

The high number of dropped messages is due to the small size of the route-back cache, records are removed from the cache faster than the response is arriving. There is an easy fix which is just increasing the size of the cache to account for this, however, how it is implemented right now it can be easily abused. I'm working on a fix that want to discuss with you:

/// Cache to store route back messages.
///
/// The interface of the cache is similar to a regular HashMap:
/// elements can be inserted, fetched and removed.
///
/// Motivation behind the following (complex) design:
///
/// Since this cache is populated with messages that should be routed
/// back it is important that elements are not removed unless there is
/// some confidence that the response is not going to arrive.
///
/// A naive cache can be easily abused, since a malicious actor can send
/// void messages that should be routed back, letting users replacing useful
/// entries on the cache for fake ones, and producing most of the messages
/// that require route back being dropped.
///
/// Solution:
///
/// The cache will accept new elements until it is full. If it receives a new
/// element but it implies going over the capacity, the message to remove is
/// selected as following:
///
/// 1. For every message store the time it arrives.
/// 2. For every peer store how many message should be routed to it.
///
/// First are removed messages that have been in the cache more time than
/// EVICTED_TIMEOUT. If no message was removed, it is removed the oldest
/// message from the peer with more messages in the cache.
///
/// Rationale:
///
/// - Old entries in the cache will be eventually removed (no memory leak).
/// - If the cache is not at full capacity, all new records will be stored.
/// - If a peer try to abuse the system, it will be able to allocate at most
///     $capacity / number_of_active_connections$ entries.
///
pub struct RouteBackCache {
    /// Number of records currently in the cache.
    size: i64,
    /// Maximum number of records allowed in the cache.
    capacity: i64,
    /// Maximum time allowed before removing a record from the cache.
    evict_timeout: i64,
    /// Main map from message hash to time where it was created + target peer.
    /// Size: O(capacity)
    main: HashMap<CryptoHash, (i64, PeerId)>,
    /// Number of records allocated by each PeerId.
    /// The size is stored with negative sign, to order in PeerId in decreasing order.
    /// Size: O(number of active connections)
    size_per_target: BTreeSet<(i64, PeerId)>,
    /// List of all hashes associated with each PeerId. Hashes within each PeerId
    /// are sorted by the time they arrived from older to newer.
    /// Size: O(capacity)
    record_per_target: BTreeMap<PeerId, BTreeSet<(i64, CryptoHash)>>,
}

cc @bowenwang1996 @SkidanovAlex

bowenwang1996 · 2020-06-12T00:30:10Z

@mfornet is there a per peer limit? Also what is roughly the size of this cache?

mfornet · 2020-06-12T01:23:54Z

is there a per peer limit?

Not explicitly, while the cache is not at max capacity all messages will be stored, but if it is working at capacity, each peer will use at most capacity / active_peers per peer (perfectly balanced) as other peer start allocating new message in the cache. Notice I will remove older messages from peers that have allocated the larger number of records in the cache.

Also what is roughly the size of this cache?

I modified the definition of RouteBackCache, it will use ~200 bytes per entry, if we use as capacity 1e6 (as I suggest) it is roughly 190MB when it is full.

fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.

Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. This PR also introduce cache strategy proposed [here](#2756 (comment)) Fix #2756 Test plan ========= - Reproduced scenario where several routed messages were dropped locally, and tested with new cache and with old cache. The fix was the size increase. - Unit tests to check the new strategy is working properly.

bowenwang1996 added the A-network Area: Network label May 31, 2020

bowenwang1996 assigned mfornet May 31, 2020

ilblackdragon added the C-bug Category: This is a bug label Jun 7, 2020

mfornet mentioned this issue Jun 12, 2020

fix: Don't drop messages. RouteBack cache update #2841

Merged

nearprotocol-bulldozer bot closed this as completed in #2841 Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message loss when network rebalances #2756

Message loss when network rebalances #2756

bowenwang1996 commented May 31, 2020

mfornet commented May 31, 2020

bowenwang1996 commented May 31, 2020

mfornet commented May 31, 2020

mfornet commented Jun 1, 2020

bowenwang1996 commented Jun 2, 2020

mfornet commented Jun 2, 2020 •

edited

Loading

mfornet commented Jun 6, 2020

bowenwang1996 commented Jun 6, 2020

mfornet commented Jun 6, 2020 •

edited

Loading

mfornet commented Jun 11, 2020 •

edited

Loading

bowenwang1996 commented Jun 12, 2020

mfornet commented Jun 12, 2020 •

edited

Loading

Message loss when network rebalances #2756

Message loss when network rebalances #2756

Comments

bowenwang1996 commented May 31, 2020

mfornet commented May 31, 2020

bowenwang1996 commented May 31, 2020

mfornet commented May 31, 2020

mfornet commented Jun 1, 2020

bowenwang1996 commented Jun 2, 2020

mfornet commented Jun 2, 2020 • edited Loading

mfornet commented Jun 6, 2020

bowenwang1996 commented Jun 6, 2020

mfornet commented Jun 6, 2020 • edited Loading

mfornet commented Jun 11, 2020 • edited Loading

bowenwang1996 commented Jun 12, 2020

mfornet commented Jun 12, 2020 • edited Loading

mfornet commented Jun 2, 2020 •

edited

Loading

mfornet commented Jun 6, 2020 •

edited

Loading

mfornet commented Jun 11, 2020 •

edited

Loading

mfornet commented Jun 12, 2020 •

edited

Loading