-
Notifications
You must be signed in to change notification settings - Fork 112
PoC of Bitswap protocol extensions implementation #189
Conversation
baada99
to
60107ca
Compare
a319ef1
to
e7c6084
Compare
Currently the proof-of-concept has some contention because of the way sessions with wants are matched with peers. This image from a recent presentation about Bitswap Improvements demonstrates the functionality we would like to implement: Currently we have a
The problem with this approach is that the Ideally the session would be able to perform the matching algorithm asynchronously so as not to block the |
This proposal eliminates the When a session becomes interested in a peer, the session registers with the
The session keeps an ordered list of wants. Each want has a "potential gain" for each peer, depending on how confident the local node is that the peer has the block. The wants are ordered by maximum potential gain, then by FIFO. The "sent potential" for a want is the sum of each want / peer combination that has already been sent to peers. For example consider a scenario in which CID1 has a potential gain of
Initially CID1 has
The local node sends WANT CID1 to Peer A.
The order changes when:
In practice this sort can be a little fuzzy, it's not necessary for it to make exactly the right choice. The interfaces are:
|
Benchmark Comparison: master vs proof-of-conceptAll benchmarks assume a fixed network delay of 10ms. Code: benchmarks_test.go 3Nodes-AllToAll-OneAtATimeFetch from two seed nodes that both have all 100 blocks.
This test fetches a block, then another block, then another block, etc until all 100 blocks have been fetched. The time taken is about the same, but the proof-of-concept fetches less duplicates. 3Nodes-AllToAll-BigBatchFetch from two seed nodes that both have all 100 blocks.
The proof-of-concept branch has less restrictive rate-limiting, so it fetches faster. 3Nodes-Overlap1-OneAtATimeFetch from two seed nodes, one at a time, where:
The only difference here is that on the proof-of-concept branch, the remote peer responds with DONT_HAVE if it doesn't have a block. The session sees that all peers (just peer A in this case) have responded with DONT_HAVE for CID 75 so it immediately broadcasts. 3Nodes-Overlap3The
3Nodes-Overlap3-OneAtATimeRequest 100 blocks, one at a time, in series.
This benchmark tests that the session can retrieve blocks at a consistent rate from inconsistently populated seeds. 3Nodes-Overlap3-BatchBy10Request 100 blocks, 10 at a time, in series.
The proof-of-concept branch sends HAVE and DONT_HAVE messages so that the requesting node can quickly determine where the desired blocks are, so overall it runs faster. 3Nodes-Overlap3-AllConcurrentRequest all 100 blocks in parallel as individual
As above, the proof-of-concept branch can quickly determine block distribution, and has less restrictive rate-limiting, meaning it's faster and there are less duplicates. 3Nodes-Overlap3-BigBatchRequest all 100 blocks at once with a single
Same reasons as above. 3Nodes-Overlap3-UnixfsFetchSimilar to how IPFS requests blocks in a DAG: request 1, then 10, then 89 blocks.
Same reasons as above. 10Nodes-AllToAll-OneAtATimeRequest 100 blocks, one by one, from 9 seeds that each have all of the blocks.
In this case all seeds have all the blocks so there is no advantage to the proof-of-concept using HAVE / DONT_HAVE messages. 10Nodes-AllToAll-BatchFetchBy10Request 100 blocks, 10 at a time, from 9 seeds that each have all of the blocks.
Same as above. 10Nodes-AllToAll-BigBatchRequest all 100 blocks with a single
The proof-of-concept has less restrictive rate-limiting so it can fetch faster. 10Nodes-AllToAll-AllConcurrentRequest all 100 blocks in parallel as individual
The proof-of-concept has less restrictive rate-limiting so it can fetch faster. 10Nodes-AllToAll-UnixfsFetchSimilar to how IPFS requests blocks in a DAG: request 1, then 10, then 89 blocks.
The proof-of-concept has less restrictive rate-limiting and better duplicate block management. 10Nodes-AllToAll-UnixfsFetchLargeSimilar to how IPFS requests blocks in a DAG: request 1, then fetch 10 at a time up to 1000.
The proof-of-concept has less restrictive rate-limiting and better duplicate block management. 10Nodes-OnePeerPerBlockThe 10Nodes-OnePeerPerBlock-OneAtATimeRequest 100 blocks, one by one, where blocks are randomly distributed across seeds.
10Nodes-OnePeerPerBlock-BigBatchRequest all 100 blocks with a single
The proof-of-concept branch has less restrictive rate-limiting and can quickly determine where the blocks are with HAVE / DONT_HAVE messages. 10Nodes-OnePeerPerBlock-UnixfsFetchSimilar to how IPFS requests blocks in a DAG: request 1, then 10, then 89 blocks, where blocks are randomly distributed across seeds.
Same reasons as above. 200Nodes-AllToAll-BigBatchFetch from 199 seed nodes, all nodes have all blocks, fetch all 20 blocks with a single
|
Previously in the proof-of-concept engine we tried to pull data from the request queue up to a maximum size (in this case 1MiB). I benchmarked this approach against EC2 and it appears to take about the same amount of time but produce a lot less duplicate data: 1 leech / 4 seeds
9 leeches / 1 seed
|
Hypotheses: Because we pack fewer blocks into a single message (as long as we include enough data to get over the minimum size), we've reduced the latency to the first block (because we no longer need to read a large message to get at the first block). Suggestion: Consider making our timeouts variable (or just longer). This suggests that we're not waiting long enough. |
Right I think reduced latency to first block makes sense. In any case it's demonstrably better so let's keep this change :) With regards to the timeouts, I'm not sure which ones you mean? |
Just popping by to say the results above are really awesome! Nice work @dirkmc |
I assume we still have timeouts such that, if we don't receive a block from a peer within a period of time, we ask another peer for that block. Higher block latencies might cause us to trip over this timeout. Alternatively, it could be that we're now able to send a cancel fast enough to actually cancel these duplicate blocks. If this is the case, we might want to tune down how likely we are to ask for duplicate blocks. |
Ah I see. So the only timeout we have now is for when we don't get a response from all peers in the session: if that times out then we broadcast (to all peers we're connected to). Currently I've implemented requests for a CID such that
In addition, because we have better knowledge of the distribution of blocks we can
So in practice we shouldn't be getting much duplicate data once we've started receiving responses and can get a sense of block distribution. |
Hm. Looking at those tables, we're still getting more duplicate data than I'd expect. Much less than before but we should be getting a few duplicate blocks at the beginning then almost nothing. |
For the 1 leech / 4 seed case the dup data is close to optimal. For the 9 leech / 1 seed case there is about 20% dup data. The reason I'm testing with the 9 leech / 1 seed use case is because it puts a lot of stress on the seed node. In practice I would assume it's quite unlikely for 9 nodes to start asking for data at exactly the same time - more likely in this kind of scenario the requests would be slightly staggered, which should reduce duplication. Do you think that's a reasonable assumption? |
Somewhat but we should test what happens when they're slightly staggered. I am a bit concerned about the streaming video use-case but we can optimize that later.
Musings for future improvement: I think we're missing a classification here. We currently have:
We probably want a third category:
|
Agreed 👍 Let's add that as a test plan parameter in testground (it's not currently possible in p2plab) With respect to sending wants, note that
|
@dirkmc I'm having trouble parsing the benchmarks above. It looks like the sum of the data sent is significantly more than the sum of the data received. |
There was a lot of output so I reduced it to just the relevant lines. I will run the benchmark again with full output to give you an idea |
Previous PoC: 7a00e140b73eec57141ae977fd782084dacaf59f 1 leech / 4 seeds
9 leeches / 1 seed
|
Something funky is going on here:
... |
You're right there is some weirdness going on there. I think maybe it's getting confused between messages received and blocks received (seeds should receive control messages but not blocks). I'll dig in and make sure it's reporting the right values. I've noticed that when there are only a few peers the stats seem a little wonky, but when there are several peers they seem to average out. Note that these stats are coming from
I will dig in and see if I can understand where the discrepancies are coming from for Bitswap, the libp2p part may take a little more digging as I'm unfamiliar with that code. |
Ah. Yeah, that may be: libp2p/go-flow-metrics#11.
|
I think the benchmarking data was just getting corrupted for that particular p2plab cluster. I created a new one and the results look reasonable:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working through a few more files.
curious if this is intentionally a "Draft" PR? What's the timeline / blockers for merging this (and unblocking ipfs/kubo#6776)? |
I consider this a draft PR There are some things remaining for discussion in the PR itself, eg
In order to merge the PR I would like to have corresponding repeatable testground tests that demonstrate
The testground tests should help answer some of the outstanding discussion points above. |
(I'm currently reviewing this bit by bit, there's just a lot here) |
@@ -73,6 +78,19 @@ func New(ctx context.Context, id uint64, tagger PeerTagger, providerFinder PeerP | |||
return spm | |||
} | |||
|
|||
func (spm *SessionPeerManager) ReceiveFrom(p peer.ID, ks []cid.Cid, haves []cid.Cid) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything on this type goes through an event loop. This is unlikely to be safe.
All this logic should be handled by s.sprm.RecordPeerResponse
(or would be if we passed haves to that function). Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to be able to determine the number of active peers in the session, we probably need to send a request to the session peer manager over the channel.
@@ -41,6 +44,7 @@ type SessionPeerManager struct { | |||
ctx context.Context | |||
tagger PeerTagger | |||
providerFinder PeerProviderFinder | |||
peers *peer.Set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of "activePeers"?
@@ -8,11 +8,14 @@ import ( | |||
"time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much code in this service is dead? It looks like we're no longer using the peer optimizer, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need the peer optimizer any more as we just send want-haves to all peers and we have a different mechanism for selecting which peer to send a want-block to for a given CID (based on HAVE / DONT_HAVE responses from peers, instead of latency).
I wanted to check in with you before making drastic changes here. I think the parts of the SessionPeerManager that we still need are
- add peer to the session when we get a block / HAVE from that peer for a CID in the session
- find more peers periodically, and also when the session gets a timeout for a CID
SessionPeerManager also takes care of tagging peers. I'm not so familiar with peer tagging, so I'm not sure if it still belongs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the session peer manager just keeps track of which peers are in the session, so tagging makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly documentation (and fixing some channel issues).
Before we merge, could we walk through code and just document everything. Importantly, document what each subsystem is supposed to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge master back in and squash before merging. |
This commit extends the bitswap protocol with two additional wantlist properties: * WANT_HAVE/HAVE: Instead of asking for a block, a node can specify that they want to know if any peers "have" the block. * WANT_HAVE_NOT/HAVE_NOT: Instead of waiting for a timeout, a node can explicitly request to be told immediately if their peers don't currently have the given block. Additionally, nodes now tell their peers how much data they have queued to send them when sending messages. This allows peers to better distribute requests, keeping all peers busy but not overloaded. Changes in this PR are described in: #186
e93e3cd
to
b3a47bc
Compare
@dirkmc could you turn the remaining TODOs into issues? The ones in the main comment and:
I believe those were the main ones. |
Shipppppeeeddddd 🚀 I'll make those issues tomorrow 👍 |
PoC of Bitswap protocol extensions implementation This commit was moved from ipfs/go-bitswap@86178ba
This is a Proof of concept of the Bitswap protocol extensions outlined in #186.
TODO:
wait at least this interval after the last invocation before calling the target function
if we're still debouncing after this amount of time, don't wait any more, call the target function anyway