-
Notifications
You must be signed in to change notification settings - Fork 112
Improving Latency Measurement #166
Comments
I'm not sure how much effort we want to spend on accurately modeling the network as there are likely too many unknowns. Knowns:
Unknowns:
So, I'm thinking we should try to come up with a scheme that optimizes for what we want regardless of the behavior of the underlying system. Let's start with a greedy approach. That is, always ask from the least loaded peer (i.e., the peer with the least blocks in its per-session wantlist)? In general, the least loaded peer should be the fastest as it's emptying its wantlist faster than the others. This is basically the "max everyone out" strategy. A couple of nice properties are:
The issue with this approach is that it assumes we can fill all the wantlists. Given the iterative nature of bitswap, this approach will work best for large datasets (with large fannout). Addendum 1: Addendum 2: We'd probably want to move peers out of the session (or into the bad set) when they stall for some period of time. Coming up: Part two where we don't have a backlog (deep graph, low fanout, latency sensitive)... |
After working through that design...
I'm not sure if this is really an issue in any case. The effective latency of that peer is lower than other peers.
This is annoying... It's actually worse than that because we still wait to consume the entire block before counting it (up to 2MiB).
I think it's time to think about extending the protocol a bit. We should consider adding a set of flags for each entry in the wantlist:
It should be safe to ignore these flags if we don't understand them. Alternatively, we could bump the protocol (but that adds logic...). |
The tricky part here is that it will often be faster to ask the same peer for two blocks than to ask two peers for one block each. However, asking the same peer for 10 blocks may be slower than asking two peers. So, my idea is to simply optimize this instead of trying to track latency/bandwidth. Basically, we ask peer A for 2 blocks and ask peer B for 1 block, if peer A returns both blocks first, we ask peer A for 3 blocks to peer B's 1 block. If we aren't processing enough CIDs at a time, we may need to add some randomization into the mix. For example, once we prepare a set of CIDs to send (with the appropriate ratios), we could (occasionally) pick a peer that hasn't been queried in a while and move it just before the last peer in the set of peers we're querying (bumping out the last peer). However, I'm not sure how to do this precisely. |
The max-out-everyone strategy is a really nice idea 💡👍 It allows us to optimize for throughput instead of latency (which is what we really care about) adjusts dynamically as conditions in the network change, simplifies timeout tracking and uses little memory (just short queues of pointers) |
If we add a NACK mechanism it should
Currently CANCEL messages are broadcast to all connected peers. We can use CANCEL as a "notification" that a peer has a block. For example when several nodes are attempting to download blocks from a "seed" peer, they can listen for CANCEL messages from the other downloaders and request those blocks from each other instead of the seed. Conceptually the ideal protocol would be streaming, with two channels for:
When the local node requests block CIDs
As nodes receive ACK / NACK / CANCEL messages they keep track of which CIDs each of their peers have. One way to reason about which block CIDs to send to which peers is to think of each block having a "potential", that we want to decrease to zero. When we send a WANT to a peer, we decrease the block's "potential" by some weight, according to the likelihood (potential) that the peer will respond with the block. The peer's potential for a block can be:
For example:
Bitswap sends WANT CID1 to peers A, B and C, so the potential for CID1 becomes Under this model Bitswap should choose the Peer for each CID so as to decrease the total potential (of all blocks) by the maximum amount. For example if Bitswap has a choice of
Each peer has a "live" send queue, ie a limited-size queue of requests that have been sent but no response has been received yet. As responses come in and slots open in the "live" queue for the peer, Bitswap performs the CID / Peer matching algorithm described above to choose which CID to put into the "live" queue and send to the peer. When a block is received the potential drops to zero and the block CID is removed from the queues. Bitswap adjusts the potential of CIDs as it receives WANT, CANCEL, ACK or NACK (including for CIDs in the "live" queue). |
Quick thoughts:
I agree with @Stebalien that I would really like to see the protocol extended and some of the proposals about look super useful. Note that each message queue currently rebroadcasts all wants every 30 seconds, and this is because the protocol has no ack or error correction. See: ipfs/specs#201 for a proposed fix there. My totally un-backed-by-data hunch is that the best kinds of payoff are in protocol extension, cause currently we really don't do a lot in terms of negotiation of wants. |
That depends on how we implement this. In many cases, we won't want to cancel the live want as we still want the block. But we can treat this leftover want differently (and not count it against the "pending" wants).
Ideally, they should be sent to peers to which we've sent the want we're canceling. Is that not the case?
That's not that reliable (e.g., context cancellation). But you're right in that we can use it as a signal that the peer might have the block and that we should consider sending them a want for it.
This is effectively what we have. Every peer opens a single control channel and a new stream per set of blocks.
Really, these should probably be Really, we could eagerly send HAVE messages in some cases (more reliable than assuming that
IMO, when sending wantlists, peers should indicate whether or not they want HAVE/HAVNT messages. For example, I might be sending an opportunistic want to the entire network. In that case, I likely want HAVE messages but not HAVNT messages. On the other hand, if I'm asking a single peer, I only want HAVNT messages and don't want HAVE messages. The issue here isn't really bandwidth, it's packet count. For these messages to be useful, we need to send them quickly. That makes it hard to batch them with outgoing blocks. If we can't do that, we have to send two packets.
We should totally do it this way (ideally with a pluggable predictor engine). Some day, bitswap will become sentient. |
If I'm reading this right, it seems like when the session receives a block it broadcasts CANCEL to all connected peers. Perhaps this is a bug, not a feature :) You're right that it would be better to use distinct HAVE and CANCEL messages to distinguish the shutdown case. I like the idea of eagerly sending HAVE messages. It's particularly useful in the case where several peers want to get the same data, as it allows them to quickly form an overlay for a particular session and judiciously request blocks. Perhaps we could put a limit on the broadcast size, although it seems like it should be a cheap message to send so maybe it doesn't matter?
That's true we'll still need to remember that we want it. But we can immediately remove the CID from the peer's request queue (instead of waiting for a timeout) which frees up a slot in the queue for another request. When we receive
👍
We're almost there :) Sending eager HAVE messages is really just a way of spreading knowledge about block locations. It may be worth fleshing this out a little - for example peers could include location information in HAVENT messages. eg in the scenario where there are connections
|
Only when it's in the last wantlist we sent them: go-bitswap/messagequeue/messagequeue.go Lines 159 to 173 in f62bf54
Unfortunately, small packets aren't much better than large packets. It may make sense to gossip some HAVE messages along with other messages but I'd punt on that till later.
Exactly (but in addition to remembering that we want it, we shouldn't tell the peer that we no longer want it). Really, we need to distinguish between wants we expect to be answered and wants we don't.
👍. I'd make that a new message type (ASK). |
Ah I see, thanks 👍 Agreed, it's simpler if we ignore broadcasting / gossiping for now. |
Fixed in ipfs/kubo#6782 |
#165 outlines latency measurement in Bitswap. The latency of the connection to a peer is maintained per-session, by measuring the time between a request for a CID and receipt of the corresponding block.
There are a few issues with measuring latency in this fashion:
Multiple sessions may concurrently request the same block.
For example:
<1 second>
<100 ms>
<1s + 100ms>
<100ms>
Incoming blocks are processed one at a time.
The latency measurement is adjusted for each block in a message (a message may contain many blocks).
If a peer doesn't have the block, it simply doesn't respond.
For broadcast requests, we ignore timeouts.
For targeted requests (requests sent to specific peers) we calculate latency based on the timeout. This isn't really accurate, as the peer may not have responded simply because it didn't have the block.
Ideally we would measure the "bandwidth delay product".
The bandwidth delay product is the
<bandwidth of the connection> x <the latency>
. It measures the amount of data that can fit in the pipe, and can be used to ensure that the pipe is always as full as possible.Issues 1 & 2 can be addressed by measuring latency per-peer instead of per-session. This would also likely improve the accuracy of latency measurement as there would be a higher sample size.
Issue 3 can be addressed by either
NOT_FOUND
messageIssue 4 is more complex, and needs to consider transport specific nuances, such as TCP slow start. Latency may be a reasonable stand-in for now.
The text was updated successfully, but these errors were encountered: