Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Limit scope of queries to k closest peers #97

Merged
merged 7 commits into from
Apr 17, 2019

Conversation

dirkmc
Copy link
Contributor

@dirkmc dirkmc commented Apr 10, 2019

Implements libp2p/go-libp2p-kad-dht#290

Stop querying once we have successfully queried >= K peers, and the only remaining peers waiting to be queried are further from the key than those K peers.

I also fixed a bug where we would return peers that we hadn't successfully connected to and refactored the Query class.

@ghost ghost assigned dirkmc Apr 10, 2019
@ghost ghost added the status/in-progress In progress label Apr 10, 2019
@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 10, 2019

There are four places we perform queries:

  • getClosestPeers()
  • findPeer()
  • findProviders()
  • getMany()

I think it makes sense to implement this behaviour for the first two, but I'm not sure about findProviders() and getMany(). @jhiesey @vasco-santos @jacobheun I'd like to hear your thoughts.

Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree on moving in this direction! If the other folks agree, I will have a deeper look to the PR codebase

@jacobheun
Copy link
Contributor

I think the same logic applies for findProviders() and getMany(). Is your concern around these two not going deep enough into the network to find potentially closer peers?

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 11, 2019

findProviders() and getMany() use disjoint paths whereas getClosestPeers() and findPeer() do not, so I wanted to consider what effect that might have.

For example when running getMany(), we first work out the number of disjoint paths, which will be equal to alpha (ie 3). Results are divided among the paths, such that each path can return at most Math.ceil(valueCount / numPaths). So if we attempt to fetch 16 values with getMany() (as IPNS does by default) the number of values per path will be Math.ceil(16 / 3) = 6. If K=20, the value is stored at 20 peers. Consider a topology where there are 8 values along one path, 9 values along the second path and 3 along the third path. In the best case we will return 6, 6 and 3 = 15 values (because the maximum per path is 6).

Limiting the scope of the query makes it run faster, because we examine less nodes, but it also makes it less likely that we will discover all possible paths to the values, meaning getMany() is more likely to fail.

@jacobheun
Copy link
Contributor

Limiting the scope of the query makes it run faster, because we examine less nodes, but it also makes it less likely that we will discover all possible paths to the values, meaning getMany() is more likely to fail.

It sounds like this is where properly streaming responses would make it easier to achieve both. Until we add the streaming responses, we'll likely have to have the limits on those two be higher, because that seems likely to fail a lot. The math is not in IPNS's favor.

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 11, 2019

Streaming responses should help. For now we could also decrease the required number of values for IPNS to be less than 16. There can only be one source crypto key for IPNS records, which in practice usually means only one source node, so IPNS records will almost always all have the same value for the same key (they would only differ when an IPNS publish hasn't reached all the nodes that a previous IPNS publish reached).

I implemented a streaming version of the DHT in this PR: #82. It's waiting on async/await PRs for a few dependencies to be approved.

@dirkmc dirkmc force-pushed the fix/find-peer-limit-scope branch 3 times, most recently from c955bc5 to c59a738 Compare April 11, 2019 22:59
test/kad-dht.spec.js Outdated Show resolved Hide resolved
@dirkmc dirkmc marked this pull request as ready for review April 12, 2019 00:16
@vasco-santos
Copy link
Member

I agree with @dirkmc and starting by targetting getClosestPeers() and findPeer()!

After that, once we introduce the stream we figure out what limits we can add to the remaining two.

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 13, 2019

I think I've found a reason that the find peer test is often failing.

I noticed that by default we limit the number of peers returned from the FIND_NODE RPC to 6, whereas the Kademlia spec (and go-libp2p-kad-dht) specify that it should be KValue (20 by default).

This will cause a problem in networks where many nodes have accurate knowledge of the closest peers to a particular key (eg if that key is requested frequently). For example in the following scenario:

  • alpha (3) peers are chosen to initiate a getClosestPeers() request
  • each of the 3 peers returns the 6 closest peers to the key
  • each of those 6 closest peers return each other
  • the total number of peers queried will end up being low, potentially as low as 10

This will cause a problem for some DHT operations, for example when doing a PUT to the DHT, this will cause a problem because it needs to store the PUT value on at least KValue (20) nodes.

Does anyone know why this limit of 6 was chosen?

Here's what the Kademlia spec says:

FIND NODE takes a 160-bit ID as an argument. The recipient of the RPC returns <IP address, UDP port, Node ID> triples for the k nodes it knows about closest to the target ID. These triples can come from a single k-bucket, or they may come from multiple k-buckets if the closest k-bucket is not full. In any case, the RPC recipient must return k items (unless there are fewer than k nodes in all its k-buckets combined, in which case it returns every node it knows about).

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 13, 2019

I fixed a bug in the queueing logic, and changed the test so that the topology is more realistic (nodes know about peers close to them), so the tests now pass consistently

@jacobheun
Copy link
Contributor

Does anyone know why this limit of 6 was chosen?

No, and it looks like it's been there since day 1, #1. We should use what's in the spec unless we have clear, documented reasoning for deviating.

It looks like the bundlesize is failing, with the added code this is understandable. We should bump https://github.com/libp2p/js-libp2p-kad-dht/blob/v0.14.12/.aegir.js#L2 to 192kb.

@vasco-santos
Copy link
Member

@dirkmc did you check if in go land, they also use 6 ? If they do not, I am totally in favor of following the spec

@jacobheun
Copy link
Contributor

whereas the Kademlia spec (and go-libp2p-kad-dht) specify that it should be KValue (20 by default)

While go is following the spec on this item, we really should avoid doing things here based on the go implementation. The go implementation has known issues and regularly references needing to follow the Kademlia spec properly. I think we should focus energy in making sure we're abiding by original Kademlia spec and pushing the libp2p spec for it, libp2p/specs#108, forward, rather than attempting to mimic the go implementation.

As long as we're not breaking interop, and we document where and why we're deviating from go (and informing them if they're not aware), I think that's the approach we should be taking.

test/query.spec.js Outdated Show resolved Hide resolved
@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 16, 2019

I changed the number of closer peers to be returned to K (@vasco-santos the go DHT does use the k value also: https://github.com/libp2p/go-libp2p-kad-dht/blob/master/handlers.go#L23)

src/query.js Outdated Show resolved Hide resolved
Copy link
Contributor

@jacobheun jacobheun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 16, 2019

@jacobheun do you think we should just enable this for getClosestPeers() and findPeer(), or for all four queries including findProviders() and getMany()?

@jacobheun
Copy link
Contributor

Perhaps we should start with getClosestPeers() and findPeer(), and then make a determination about the other two? Outlining the pros and cons for them would be helpful.

It might be valuable to have the logic in findProviders() moreso than getMany(), but it we can't yet stream, having outright failures is not good.

One of the current unfortunate things about findProviders() having that threshold, is that getting new content can be potentially impossible to get. For example, in ipfs-bitswap, when I try to get new content you've created, if you're the only provider when I search, I'll never get the content because 1 is less than my minimum threshold. It will eventually timeout because I can't find 20 peers. For DHT performance in IPFS we have to fix that.

Maybe it's worth adding findProviders until we can properly support streaming?

@vasco-santos
Copy link
Member

@jacobheun totally agree that we should follow the kademlia spec first and then try to iterate on improving it. In that time, we may check the go implementation looking for possible optimizations they did to the paper. We just need to make this visible for everyone being on the same page.

Regarding the 4 functions that perform queries, I think we should do this by steps. I would go with these two now. However, I agree that it would be worth exploring this for findProviders as well. My proposal is to get the first two merged and then work on adding the findProviders in a separate PR. That way, we can easily test the differences and revert findProviders if needed. What do you think? @dirkmc @jacobheun

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 16, 2019

Thinking about it a bit more, we may need to decide for each case in which we call any of those four functions. For example when doing a DHT PUT we call getClosestPeers() and store the value at each of those peers. We want to be sure that those really are the closest peers or it will decrease the chance of getMany() finding enough values.

@vasco-santos
Copy link
Member

@dirkmc do you think it is worth putting together a list of all the possible cases and discuss for each of them which is the best solution. We should also get @jhiesey involved on these decisions

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 17, 2019

Yes I think that makes sense, I will put together a list

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 17, 2019

The paths through the code that create a Query are:

  • getMany()
  • findProviders()
  • findPeer()
    • As a public API method
    • Called by random walk to discover peers
  • getClosestPeers()
    • As a public API method
    • Called by provide() to add self as a provider for a key
    • Called by put() to find nodes on which to store a key / value in the DHT

@jacobheun
Copy link
Contributor

So getClosestPeers() belongs only to write actions to the network. This makes sense because we want to be accurate when we're storing values. The downside of this is that getClosestPeers should, by design, be "slow", moreso for newer peers with sparse routing tables.

For the fetching functions getMany, findPeer, findProviders we should be as quick as possible. As mentioned, I think this may warrant reducing the tolerance on number of peers required for a success. Finding 16 records when there are potentially only 20 nodes who have values feels too high to me, especially if the query should be relatively quick. With a network that is still in flux, with so many new nodes coming and going, I think lowering the threshold at least initially is a reasonable approach.

For eventually getting the tolerance back up to a higher level (16), there's discussion about adopting sloppiness at libp2p/specs#163. While it's not the same concept of sloppiness, I was thinking about how being sloppy with the number of writes, combined with eventual accuracy, might be helpful in alleviating some of this. For example, we discussed concerns about spending enough time finding closest peers when performing a PUT, so maybe exiting early isn't ideal, there could be closer peers. If we operated "sloppily" we could perform a PUT with the closest kValue (20) we have when we would normally end the query, and then continue searching looking for eventual accuracy. Once the query was "done", we could PUT to any closer peers we hadn't already. This breaks the Spec a bit because we're potentially writing to more than kValue peers, but I think there is merit in the fault tolerance it provides to the network.

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 17, 2019

That's an interesting idea. I wonder in practice how often we won't find the actual closest 20 peers. Maybe it's worth merging this in and trying it out in the real network?

@jacobheun
Copy link
Contributor

Maybe it's worth merging this in and trying it out in the real network?

Yes, I think getting this out so we can test with it and iterate is a good path forward.

@vasco-santos
Copy link
Member

Thanks for the list @dirkmc

I liked your proposal @jacobheun

I will merge this as it looks good and we need to test it, but I believe we will want to iterate on this.
There is important information that we will need for the next iterations, so I will create an issue for tracking the good ideas from this PR and we can keep discussing there.

@vasco-santos vasco-santos merged commit f03619e into libp2p:master Apr 17, 2019
@ghost ghost removed the status/in-progress In progress label Apr 17, 2019
@vasco-santos vasco-santos mentioned this pull request Apr 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants