Improving Latency Measurement #166

dirkmc · 2019-08-02T14:26:18Z

#165 outlines latency measurement in Bitswap. The latency of the connection to a peer is maintained per-session, by measuring the time between a request for a CID and receipt of the corresponding block.

There are a few issues with measuring latency in this fashion:

Multiple sessions may concurrently request the same block.
For example:
- session A requests CID1
- <1 second>
- session B requests CID1
- <100 ms>
- the block for CID1 arrives (from the request by session A)
- session A records latency as <1s + 100ms>
- session B records latency as <100ms>
Incoming blocks are processed one at a time.
The latency measurement is adjusted for each block in a message (a message may contain many blocks).
If a peer doesn't have the block, it simply doesn't respond.
For broadcast requests, we ignore timeouts.
For targeted requests (requests sent to specific peers) we calculate latency based on the timeout. This isn't really accurate, as the peer may not have responded simply because it didn't have the block.
Ideally we would measure the "bandwidth delay product".
The bandwidth delay product is the <bandwidth of the connection> x <the latency>. It measures the amount of data that can fit in the pipe, and can be used to ensure that the pipe is always as full as possible.

Issues 1 & 2 can be addressed by measuring latency per-peer instead of per-session. This would also likely improve the accuracy of latency measurement as there would be a higher sample size.

Issue 3 can be addressed by either

changing the protocol such that if a peer doesn't have a CID it sends back a NOT_FOUND message
ignoring timeouts in latency calculations (this would be much simpler)

Issue 4 is more complex, and needs to consider transport specific nuances, such as TCP slow start. Latency may be a reasonable stand-in for now.

The text was updated successfully, but these errors were encountered:

Stebalien · 2019-08-02T19:33:12Z

I'm not sure how much effort we want to spend on accurately modeling the network as there are likely too many unknowns.

Knowns:

Our bandwidth (sort of). Unfortunately, knowing this accurately requires reserving bandwidth.
Latency (we could even use a separate protocol for this).

Unknowns:

Their bandwidth (fluctuates based on load).
Whether or not they have something (although I guess we could know this).

So, I'm thinking we should try to come up with a scheme that optimizes for what we want regardless of the behavior of the underlying system.

Let's start with a greedy approach. That is, always ask from the least loaded peer (i.e., the peer with the least blocks in its per-session wantlist)? In general, the least loaded peer should be the fastest as it's emptying its wantlist faster than the others.

This is basically the "max everyone out" strategy. A couple of nice properties are:

We don't care about other sessions. If multiple sessions ask the same peer for the same block, that peer is, effectively, faster.
We don't care about block sizes as we're not even trying to measure rates.

The issue with this approach is that it assumes we can fill all the wantlists. Given the iterative nature of bitswap, this approach will work best for large datasets (with large fannout).

Addendum 1: ~~If we're using this approach and we notice that wantlists are maxing out, we'd increase the max size of that wantlist.~~ edit: the opposite; we need to notice that the wantlist is emptying which may be hard to determine... We'd determine a peer's load based on the percentage of their wantlist that's full.

Addendum 2: We'd probably want to move peers out of the session (or into the bad set) when they stall for some period of time.

Coming up: Part two where we don't have a backlog (deep graph, low fanout, latency sensitive)...

Stebalien · 2019-08-02T22:05:32Z

After working through that design...

Multiple sessions may concurrently request the same block.

I'm not sure if this is really an issue in any case. The effective latency of that peer is lower than other peers.

Incoming blocks are processed one at a time.

This is annoying... It's actually worse than that because we still wait to consume the entire block before counting it (up to 2MiB).

If a peer doesn't have the block, it simply doesn't respond.

I think it's time to think about extending the protocol a bit. We should consider adding a set of flags for each entry in the wantlist:

Ack requested - please ack if you have this ASAP (before sending).
Nack requested - please nack if you don't have this (doesn't cancel, just tells us to look elsewhere).
ack-only - please don't actually send the block, just ack (item not added to wantlist, should we just have an entirely separate "ack only" list?).

It should be safe to ignore these flags if we don't understand them. Alternatively, we could bump the protocol (but that adds logic...).

Stebalien · 2019-08-03T02:03:32Z

Coming up: Part two where we don't have a backlog (deep graph, low fanout, latency sensitive)...

The tricky part here is that it will often be faster to ask the same peer for two blocks than to ask two peers for one block each. However, asking the same peer for 10 blocks may be slower than asking two peers.

So, my idea is to simply optimize this instead of trying to track latency/bandwidth. Basically, we ask peer A for 2 blocks and ask peer B for 1 block, if peer A returns both blocks first, we ask peer A for 3 blocks to peer B's 1 block.

If we aren't processing enough CIDs at a time, we may need to add some randomization into the mix. For example, once we prepare a set of CIDs to send (with the appropriate ratios), we could (occasionally) pick a peer that hasn't been queried in a while and move it just before the last peer in the set of peers we're querying (bumping out the last peer).

However, I'm not sure how to do this precisely.

dirkmc · 2019-08-03T16:15:23Z

The max-out-everyone strategy is a really nice idea 💡👍

It allows us to optimize for throughput instead of latency (which is what we really care about) adjusts dynamically as conditions in the network change, simplifies timeout tracking and uses little memory (just short queues of pointers)

dirkmc · 2019-08-05T20:59:04Z

If we add a NACK mechanism it should

help keep the size of the "live" request queue low
reduce the need for timers
allow us to more accurately track the likelihood that a peer has the block we want

Currently CANCEL messages are broadcast to all connected peers. We can use CANCEL as a "notification" that a peer has a block. For example when several nodes are attempting to download blocks from a "seed" peer, they can listen for CANCEL messages from the other downloaders and request those blocks from each other instead of the seed.

Conceptually the ideal protocol would be streaming, with two channels for:

Control messages (WANT / CANCEL, ACK / NACK) - high priority
Blocks - low priority

When the local node requests block CIDs

peers respond with ACK / NACK for each CID and immediately start sending blocks, in order, for ACKed CIDs.
as the local node receives blocks, it broadcasts CANCEL to all connected peers
peers that receive CANCEL remove any pending block from the queue of blocks to be sent to the local node

As nodes receive ACK / NACK / CANCEL messages they keep track of which CIDs each of their peers have.

One way to reason about which block CIDs to send to which peers is to think of each block having a "potential", that we want to decrease to zero. When we send a WANT to a peer, we decrease the block's "potential" by some weight, according to the likelihood (potential) that the peer will respond with the block. The peer's potential for a block can be:

High: Peer sent a CANCEL message for the block CID
High: Peer is a provider for the block CID
Medium: Peer ACKed the CID of a different block in this session (weighted by ratio of ACK / NACK, with a higher weight for ACKs in this block's GetBlock() group)
Medium: Peer is a provider for a different block CID in the session / GetBlock() group
Low: Peer sent a WANT for the block CID (but didn't yet send a CANCEL)
Low: Peer sent a WANT for a different block CID in the session / GetBlock() group
Very low: Random peer

For example:

block CID1 has a potential of 2.2
peer A has sent a CANCEL message for the block CID1 (eg potential 1.0)
peer B has a high ACK / NACK ratio for blocks in this session (eg potential 0.7)
peer C sent a WANT for CID1 (eg potential 0.2)

Bitswap sends WANT CID1 to peers A, B and C, so the potential for CID1 becomes
2.2 - (1.0 + 0.7 + 0.2) = 0.3

Under this model Bitswap should choose the Peer for each CID so as to decrease the total potential (of all blocks) by the maximum amount. For example if Bitswap has a choice of

send CID5 to Peer A (potential 0.1)
send CID5 to Peer B (potential 0.7)
send CID8 to Peer A (potential 0.2)
send CID8 to Peer B (potential 0.5)
It would choose to send CID5 to Peer B (potential 0.7)

Each peer has a "live" send queue, ie a limited-size queue of requests that have been sent but no response has been received yet. As responses come in and slots open in the "live" queue for the peer, Bitswap performs the CID / Peer matching algorithm described above to choose which CID to put into the "live" queue and send to the peer. When a block is received the potential drops to zero and the block CID is removed from the queues. Bitswap adjusts the potential of CIDs as it receives WANT, CANCEL, ACK or NACK (including for CIDs in the "live" queue).

hannahhoward · 2019-08-05T22:50:07Z

Quick thoughts:

Agree with multiple issues identified with current latency tracking.
The question is one of amount of time to invest vs payoff to me. The only real purpose latency serves (right now) is ordering the peers. So it doesn't have to be super precise. The real question is which of these are most important. (or which are the lowest hanging fruit)

I agree with @Stebalien that I would really like to see the protocol extended and some of the proposals about look super useful. Note that each message queue currently rebroadcasts all wants every 30 seconds, and this is because the protocol has no ack or error correction. See: ipfs/specs#201 for a proposed fix there.

My totally un-backed-by-data hunch is that the best kinds of payoff are in protocol extension, cause currently we really don't do a lot in terms of negotiation of wants.

Stebalien · 2019-08-07T03:01:46Z

If we add a NACK mechanism it should
help keep the size of the "live" request queue low

That depends on how we implement this. In many cases, we won't want to cancel the live want as we still want the block.

But we can treat this leftover want differently (and not count it against the "pending" wants).

Currently CANCEL messages are broadcast to all connected peers.

Ideally, they should be sent to peers to which we've sent the want we're canceling. Is that not the case?

We can use CANCEL as a "notification" that a peer has a block.

That's not that reliable (e.g., context cancellation). But you're right in that we can use it as a signal that the peer might have the block and that we should consider sending them a want for it.

Conceptually the ideal protocol would be streaming, with two channels for:

This is effectively what we have. Every peer opens a single control channel and a new stream per set of blocks.

ACK / NACK

Really, these should probably be HAVE and HAVNT given that bitswap doesn't have requests (so ack/nack doesn't really make sense).

Really, we could eagerly send HAVE messages in some cases (more reliable than assuming that CANCEL means HAVE. We could also say that HAVE implies CANCEL?

peers respond with ACK / NACK for each CID and immediately start sending blocks, in order, for ACKed CIDs.

IMO, when sending wantlists, peers should indicate whether or not they want HAVE/HAVNT messages. For example, I might be sending an opportunistic want to the entire network. In that case, I likely want HAVE messages but not HAVNT messages. On the other hand, if I'm asking a single peer, I only want HAVNT messages and don't want HAVE messages.

The issue here isn't really bandwidth, it's packet count. For these messages to be useful, we need to send them quickly. That makes it hard to batch them with outgoing blocks. If we can't do that, we have to send two packets.

potential...

We should totally do it this way (ideally with a pluggable predictor engine).

Some day, bitswap will become sentient.

dirkmc · 2019-08-07T14:34:14Z

Currently CANCEL messages are broadcast to all connected peers.

Ideally, they should be sent to peers to which we've sent the want we're canceling. Is that not the case?

If I'm reading this right, it seems like when the session receives a block it broadcasts CANCEL to all connected peers. Perhaps this is a bug, not a feature :)

You're right that it would be better to use distinct HAVE and CANCEL messages to distinguish the shutdown case. I like the idea of eagerly sending HAVE messages. It's particularly useful in the case where several peers want to get the same data, as it allows them to quickly form an overlay for a particular session and judiciously request blocks. Perhaps we could put a limit on the broadcast size, although it seems like it should be a cheap message to send so maybe it doesn't matter?

If we add a NACK mechanism it should help keep the size of the "live" request queue low

That depends on how we implement this. In many cases, we won't want to cancel the live want as we still want the block.

That's true we'll still need to remember that we want it. But we can immediately remove the CID from the peer's request queue (instead of waiting for a timeout) which frees up a slot in the queue for another request. When we receive NACK CID1 from peer A we would

remove CID1 from the request queue for peer A
increase CID1's potential
perform the peer/block matching algorithm to decide which other block to add to peer A's request queue

IMO, when sending wantlists, peers should indicate whether or not they want HAVE/HAVNT messages

👍

Some day, bitswap will become sentient.

We're almost there :)

Sending eager HAVE messages is really just a way of spreading knowledge about block locations. It may be worth fleshing this out a little - for example peers could include location information in HAVENT messages. eg in the scenario where there are connections
peer A <-> peer B and peer B <-> peer C:

peer A -> [HAVE CID1] -> <broadcast>
peer C -> [WANT CID1] -> peer B
peer B -> [HAVENT CID1 / ASK peer A] -> peer C

Stebalien · 2019-08-07T20:34:28Z

If I'm reading this right, it seems like when the session receives a block it broadcasts CANCEL to all connected peers. Perhaps this is a bug, not a feature :)

Only when it's in the last wantlist we sent them:

go-bitswap/messagequeue/messagequeue.go

Lines 159 to 173 in f62bf54

    
           	for _, e := range entries { 
        
           		if e.Cancel { 
        
           			if mq.wl.Remove(e.Cid, ses) { 
        
           				work = true 
        
           				mq.nextMessage.Cancel(e.Cid) 
        
           			} 
        
           		} else { 
        
           			if mq.wl.Add(e.Cid, e.Priority, ses) { 
        
           				work = true 
        
           				mq.nextMessage.AddEntry(e.Cid, e.Priority) 
        
           			} 
        
           		} 
        
           	} 
        
           	return work 
        
           }

Perhaps we could put a limit on the broadcast size, although it seems like it should be a cheap message to send so maybe it doesn't matter?

Unfortunately, small packets aren't much better than large packets. It may make sense to gossip some HAVE messages along with other messages but I'd punt on that till later.

That's true we'll still need to remember that we want it. But we can immediately remove the CID from the peer's request queue (instead of waiting for a timeout) which frees up a slot in the queue for another request. When we receive NACK CID1 from peer A we would

Exactly (but in addition to remembering that we want it, we shouldn't tell the peer that we no longer want it).

Really, we need to distinguish between wants we expect to be answered and wants we don't.

for example peers could include location information in HAVENT messages. eg in the scenario where there are connections

👍. I'd make that a new message type (ASK).

dirkmc · 2019-08-07T20:55:09Z

Ah I see, thanks 👍

Agreed, it's simpler if we ignore broadcasting / gossiping for now.

dirkmc · 2020-03-09T13:30:33Z

Fixed in ipfs/kubo#6782

dirkmc mentioned this issue Aug 2, 2019

Improve bitswap performance ipfs-inactive/package-managers#85

Open

dirkmc mentioned this issue Aug 2, 2019

Bitswap Sessions Exploration #165

Closed

dirkmc mentioned this issue Aug 8, 2019

Improve Request Sharding #167

Closed

dirkmc mentioned this issue Aug 23, 2019

[Draft] Bitswap Protocol Extensions #186

Closed

dirkmc closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Latency Measurement #166

Improving Latency Measurement #166

dirkmc commented Aug 2, 2019 •

edited

Loading

Stebalien commented Aug 2, 2019 •

edited

Loading

Stebalien commented Aug 2, 2019

Stebalien commented Aug 3, 2019

dirkmc commented Aug 3, 2019 •

edited

Loading

dirkmc commented Aug 5, 2019 •

edited

Loading

hannahhoward commented Aug 5, 2019

Stebalien commented Aug 7, 2019

dirkmc commented Aug 7, 2019

Stebalien commented Aug 7, 2019

dirkmc commented Aug 7, 2019

dirkmc commented Mar 9, 2020

Improving Latency Measurement #166

Improving Latency Measurement #166

Comments

dirkmc commented Aug 2, 2019 • edited Loading

Stebalien commented Aug 2, 2019 • edited Loading

Stebalien commented Aug 2, 2019

Stebalien commented Aug 3, 2019

dirkmc commented Aug 3, 2019 • edited Loading

dirkmc commented Aug 5, 2019 • edited Loading

hannahhoward commented Aug 5, 2019

Stebalien commented Aug 7, 2019

dirkmc commented Aug 7, 2019

Stebalien commented Aug 7, 2019

dirkmc commented Aug 7, 2019

dirkmc commented Mar 9, 2020

dirkmc commented Aug 2, 2019 •

edited

Loading

Stebalien commented Aug 2, 2019 •

edited

Loading

dirkmc commented Aug 3, 2019 •

edited

Loading

dirkmc commented Aug 5, 2019 •

edited

Loading