-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message loss when network rebalances #2756
Comments
It is not clear why routed message got lost during rebalance. If nodes are removing other connections it is expected that they have already high number of connections so they should have some route to target, message are only dropped when no route to target is found or when message have been routed through more than 100 hops. How do you know that messages are being dropped, because of logs? |
Is it guaranteed in any way? I didn't see it. |
What is guaranteed is that nodes will drop connections if the number of active peers go over some threshold (currently 35). This a high enough number to be confident that the network should remain connected in the whole process. Again, are you strictly confident that messages are being dropped (do you saw it on logs)? If that is the case, maybe the root cause is another. |
@bowenwang1996 Before jumping into this issue we should investigate further how many message are being dropped and the real cause of those messages being dropped. I mean, for sure they are dropped if there is no route available (it is possible also that message is being routed in a cycle and gets dropped after TTL=0, but it is very unlikely unless a bug or very pathological behavior from peers in the network). |
@mfornet I agree. How about we add some counter to record when a message fails to deliver, as well as its type? |
Sounds good, let's see if that helps us. I'll add it as a Prometheus counter. |
@bowenwang1996 did this in #2804 Check the graph |
@mfornet that's a lot of dropped messages. Is what shown in the graph cumulative? |
Yes, it is cumulative. I see how ~10K messages were dropped in 20minutes, and since then it has been dropping messages slowly, mostly |
The high number of dropped messages is due to the small size of the route-back cache, records are removed from the cache faster than the response is arriving. There is an easy fix which is just increasing the size of the cache to account for this, however, how it is implemented right now it can be easily abused. I'm working on a fix that want to discuss with you: /// Cache to store route back messages.
///
/// The interface of the cache is similar to a regular HashMap:
/// elements can be inserted, fetched and removed.
///
/// Motivation behind the following (complex) design:
///
/// Since this cache is populated with messages that should be routed
/// back it is important that elements are not removed unless there is
/// some confidence that the response is not going to arrive.
///
/// A naive cache can be easily abused, since a malicious actor can send
/// void messages that should be routed back, letting users replacing useful
/// entries on the cache for fake ones, and producing most of the messages
/// that require route back being dropped.
///
/// Solution:
///
/// The cache will accept new elements until it is full. If it receives a new
/// element but it implies going over the capacity, the message to remove is
/// selected as following:
///
/// 1. For every message store the time it arrives.
/// 2. For every peer store how many message should be routed to it.
///
/// First are removed messages that have been in the cache more time than
/// EVICTED_TIMEOUT. If no message was removed, it is removed the oldest
/// message from the peer with more messages in the cache.
///
/// Rationale:
///
/// - Old entries in the cache will be eventually removed (no memory leak).
/// - If the cache is not at full capacity, all new records will be stored.
/// - If a peer try to abuse the system, it will be able to allocate at most
/// $capacity / number_of_active_connections$ entries.
///
pub struct RouteBackCache {
/// Number of records currently in the cache.
size: i64,
/// Maximum number of records allowed in the cache.
capacity: i64,
/// Maximum time allowed before removing a record from the cache.
evict_timeout: i64,
/// Main map from message hash to time where it was created + target peer.
/// Size: O(capacity)
main: HashMap<CryptoHash, (i64, PeerId)>,
/// Number of records allocated by each PeerId.
/// The size is stored with negative sign, to order in PeerId in decreasing order.
/// Size: O(number of active connections)
size_per_target: BTreeSet<(i64, PeerId)>,
/// List of all hashes associated with each PeerId. Hashes within each PeerId
/// are sorted by the time they arrived from older to newer.
/// Size: O(capacity)
record_per_target: BTreeMap<PeerId, BTreeSet<(i64, CryptoHash)>>,
} |
@mfornet is there a per peer limit? Also what is roughly the size of this cache? |
Not explicitly, while the cache is not at max capacity all messages will be stored, but if it is working at capacity, each peer will use at most capacity / active_peers per peer (perfectly balanced) as other peer start allocating new message in the cache. Notice I will remove older messages from peers that have allocated the larger number of records in the cache.
I modified the definition of |
fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.
fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.
fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.
fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.
fix #2756 Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. Note: This is a vector of attack if we rely on route-back-messages, since malicious actor can feed the cache with new information, effectively invalidating our current cache. Test plan ========= Reproduced this scenario locally, and tested with and without the cache.
Increased route back cache to avoid dropping messages which are supposed to be routed back. The problem was that when some nodes are syncing several chunk request are asked, and request arrive faster than responses, effectively invalidating the cache most of the responses are dropped, since the route back hash was dropped from the cache. This PR also introduce cache strategy proposed [here](#2756 (comment)) Fix #2756 Test plan ========= - Reproduced scenario where several routed messages were dropped locally, and tested with new cache and with old cache. The fix was the size increase. - Unit tests to check the new strategy is working properly.
In #2490, we introduced network balancing to avoid reaching maximum number of peers. However, it comes at a cost -- when a message is being routed, it can get dropped due to the connection between two nodes on the route getting dropped when they rebalance the network connection. It makes the system slower overall. In some cases, the requester will re-request or request from other peers (for example for partial encoded chunks). In other cases, the message is simply dropped. This can have some fairly undesirable consequences such as block producers not getting enough approvals to produce their blocks. It seems that this also allows an attacker to initiate an eclipse attack on a node that can have some profound impact on the overall network topology.
It is not clear to me, however, how to best solve this issue. Some ideas:
The text was updated successfully, but these errors were encountered: