Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHT Query Performance #88

Closed
vyzo opened this issue Sep 12, 2017 · 52 comments
Closed

DHT Query Performance #88

vyzo opened this issue Sep 12, 2017 · 52 comments
Labels
P0 Critical: Tackled by core team ASAP status/ready Ready to be worked

Comments

@vyzo
Copy link
Contributor

vyzo commented Sep 12, 2017

DHT queries are slow, in the order of several seconds, which is a performance problem.
This issue is here to discuss performance and propose ways to measure and improve it.

@vyzo
Copy link
Contributor Author

vyzo commented Sep 12, 2017

@vyzo
Copy link
Contributor Author

vyzo commented Sep 12, 2017

At a first glance, a potentially big chunk of query slowdown comes from the need to resolve the key to verify the record in the response.

Specifically, every time we get a record in getValueOrPeers, we invoke verifyRecordOnline, which in turn invokes GetPublicKey. This attempts to extract the key from the id (ed25519 keys), and then tries to query the node directly and then hit the DHT again for key retrieval.

This suggests a possible optimization: piggyback the key for the record in the query response.

Note that Ed25519 keys will make this point moot, as the key will be extractable from the peer id. Nonetheless, we will have to deal with RSA keys for years to come, so it's worthwhile to optimize this case.

@vyzo
Copy link
Contributor Author

vyzo commented Sep 12, 2017

Another possible issue is the big lock in messageSender.SendRequest.

The lock is held for the duration of each request: send+receive response, which means that we can only have a single outstanding request per peer at a time.

@whyrusleeping
Copy link
Contributor

One other related thing I noticed was that it appears we are sending out provider messages to the same peers for the same keys multiple times. This should probably get investigated too at some point.

@vyzo
Copy link
Contributor Author

vyzo commented Sep 22, 2017

That definitely sounds like it needs to be investigated.

@vyzo
Copy link
Contributor Author

vyzo commented Sep 22, 2017

Some analysis of the results with pipelining (#92), suggests that we have a 75ms response time in findPeerSingle:

{"event":"findPeerSingleBegin","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.09138292Z"}
{"event":"findPeerSingleBegin","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.091476318Z"}
{"event":"findPeerSingleBegin","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.09169362Z"}
{"duration":107551379,"event":"findPeerSingle","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.199282563Z"}
{"duration":175856230,"event":"findPeerSingle","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.267292144Z"}
{"duration":256608137,"event":"findPeerSingle","peerID":"QmdFJrz6FXQU15UVCV1Evs4Ftq8xVvAtxfacC23cXySpQZ","system":"dht","time":"2017-09-21T14:54:58.348002273Z"}

@vyzo
Copy link
Contributor Author

vyzo commented Sep 26, 2017

Analysis of event log data from the SOL indicates a very long tailed distribution in our response time for all request types. Most of the requests (80%) take a short time -- under 10ms, but there is a very long tail that reaches to tens of seconds.

Processed data and plots from the analysis is pinned at /ipfs/QmfW3YHad4RGHu899u8GopvoihgWgo58Az5Mv6VvdsjX1N

@dirkmc
Copy link
Contributor

dirkmc commented Feb 8, 2018

@vyzo @whyrusleeping @Stebalien I'm interested in helping out on this, what's the current plan?

@Stebalien
Copy link
Member

@dirkmc

  • Now: finish DHT Request pipelining #92.
  • Soon: Implement a request-based protocol. Although, this may not be necessary if DHT Request pipelining #92 ends up working.
  • Medium-Short Term: finish getting QUIC integrated (will make dialing faster).
  • Long Term: Get a (possibly unreliable) request/message transport in libp2p.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 8, 2018

Ok thanks for the update @Stebalien, sounds like you guys have this covered :)
I'll ask the same over at ipfs/kubo#2934

@dirkmc
Copy link
Contributor

dirkmc commented Feb 12, 2018

I'm not sure how often this occurs in practice, but I noticed that if GetPublicKey can't extract the public key, get it from the peerstore or get it from the node itself, then it will call GetValue(). GetValue() will in turn request 16 values from the closest peers. However the public key is immutable, so if I understand correctly it should only need one value, right? Would it make sense to call getPublicKeyFromNode() and GetValues(ctx, key, 1) in parallel?

@dirkmc
Copy link
Contributor

dirkmc commented Feb 12, 2018

@whyrusleeping
Copy link
Contributor

@dirkmc Yeah, it might help to set that down to one value. We will need to expose the GetValues method on the routing/dht interface for that.

I'm curious how often that occurs in practice. Maybe throw a log in there and see how often it gets hit?

@dirkmc
Copy link
Contributor

dirkmc commented Feb 12, 2018

It looks like GetValues is already exposed: https://github.com/libp2p/go-libp2p-routing/blob/master/routing.go#L58

@dirkmc
Copy link
Contributor

dirkmc commented Feb 12, 2018

When trying this out today I found that my node rarely receives a value from more than 8 or 9 nodes for a particular key before the one minute timeout is reached. If I reduce the number of required values to one (eg for retrieving a public key) it usually resolves within about 10 seconds.

When calling GetPublicKey() it was not often able to resolve the public key by asking the peer with that public key directly (maybe the peer itself was offline). Most of the time it would go out to the DHT to retrieve the public key. This seems to be pretty common for the use case of namesys resolution of an arbitrarily chosen name.

One complication is that the nvals parameter in GetValues(ctx context.Context, key string, nvals int) refers to the number of responses to wait for, but those responses can include failures. So if we reducenvals to one and the one node we query responds with failure (eg because that particular node doesn't have a route to find the value) then we fail. Would it make sense to provide another parameter to GetValues() to tell it the minimum number of success values to wait for?

@whyrusleeping
Copy link
Contributor

@dirkmc great findings on the number of values received. We should look into different ways of getting that 8 or 9 as high as we can.

As for the nvals, It shouldnt include failures. It should only count successful values received. I think thats a bug otherwise.

@whyrusleeping
Copy link
Contributor

Ah, it counts invalid records: https://github.com/libp2p/go-libp2p-kad-dht/blob/master/routing.go#L208

So thats not for arbitrary failures, but specifically for when we receive a value that is not valid. In the case of a public key, this could mean they sent us back the wrong key. So youre right, this can't be dropped to 1.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 12, 2018

Ah yes you're right, it's only invalid records. That seems like a pretty rare case, maybe the answer in that case is just to return an error from GetPublicKey()?

It seems like the successful set of values is reached relatively fast, maybe in the first 20 seconds, and then it waits for the rest of the dial attempts to time out. Is there a way to fail fast for dial errors, rather than waiting for timeouts? For example if it's a connection refused error that should return pretty quickly, right?

@whyrusleeping
Copy link
Contributor

Connection refused returns pretty quickly, but most failed dials on the internet don't hit that. Most firewalls on the internet just drop packets, as opposed to denying them. If a firewall drops your SYN packet, youre SOL until the timeout.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 13, 2018

That makes sense.

Peers should only get into my kbucket if either I have been able to ping them, or a node that I can reach has been able to ping them. So in practice I would imagine the reason for not being able to reach a peer in the kbucket should only infrequently be because of firewalls, right?

@whyrusleeping
Copy link
Contributor

firewalls, or if theyve gone offline.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 13, 2018

I'm going to dig into the network code a little more tomorrow.
One way of improving the response rate would be to keep the kbuckets more up to date somehow.
Another way that occurs to me might be to keep a record of the ping time to a peer, and use some multiple of that time as the timeout, rather than assuming one minute for every node. eg if I have a 200ms ping to a node, the next time I ping it I can probably assume it will respond within a second or two.

@ajbouh
Copy link

ajbouh commented Feb 13, 2018 via email

@whyrusleeping
Copy link
Contributor

@dirkmc Thinking through things, I actually think that we remove peers from our routing table once we are no longer connected to them (see #45) While this might technically be a bug, It makes the problem a little simpler. Pretty much every peer we get handed right now via DHT queries should be there, its just a matter of actually connecting to them.

@vyzo
Copy link
Contributor Author

vyzo commented Feb 18, 2018

that's very interesting indeed; thanks @dirkmc.

@dirkmc
Copy link
Contributor

dirkmc commented Feb 19, 2018

It seems like the reason for the successful Connect after exactly 60 seconds is that

Once the rate limited dialer has hit its limit of 160 concurrent dials, it blocks until a success or error response (eg a 60 second dial timeout) before trying the next dial in its queue. 60 seconds after a request begins, a whole bunch of these time out at once, freeing up slots in the dialer. The dialer now dials a bunch of new addresses, sometimes for a peer where a previous address failed. When one of these is successful we get the successful connect after 60 seconds mentioned above.

Quick fixes:

  • In the logs I see that sometimes we're dialing private addresses, eg 10.x.x.x and 192.168.x.x
  • The dialer sometimes dials different ports on the same IP address, which doesn't make sense for some kinds of errors (eg no route to host)

Other ideas:

  • It's still not clear to me why such a high percentage of dials time out. It would be good to get to the bottom of this
  • Currently the dialer will exhaust the 8 addresses for a peer before moving onto the next peer. There may be performance gains if the dialer instead selects addresses round robin from waiting peers, eg
    peer[0].addr[0], peer[1].addr[0], peer[2].addr[0]....
    peer[0].addr[1], peer[1].addr[1], peer[2].addr[1]...
    This would mean the dialer makes attempts across the breadth of peers quickly, hopefully reaching a live peer faster, rather than going deep into one peer's addresses before starting on the next peer.
  • I think the most dramatic performance improvement would be to decrease the 60 second dial timeout. One suggestion would be for peers to include their RTT to each node in the response to the closest peers request. If the RTT from Alice to Bob is 200ms, and from Bob to Carol is 300ms, then a reasonable estimate can be made for the RTT from Alice to Carol, and a correspondingly lower dial timeout can be used.

@Stebalien
Copy link
Member

It's still not clear to me why such a high percentage of dials time out. It would be good to get to the bottom of this

Part of this is because we were remembering (indefinitely) and gossiping bad peer addresses. This will be partially fixed in the next release (once a large fraction of the network has upgraded) but we should put some work into fixing this issue in general. I've filed libp2p/libp2p#27 to discuss.

The dialer sometimes dials different ports on the same IP address, which doesn't make sense for some kinds of errors (eg no route to host)

I assume these dials are failing quickly so I don't think handling that error would help.

It's still not clear to me why such a high percentage of dials time out. It would be good to get to the bottom of this

  • Lots of ephemeral nodes that act as DHT nodes. Ideally, nodes would only start acting as DHT servers after a period of stable uptime and after noticing that they appear to be reachable from the network.
  • Lots of ephemeral addresses that we shouldn't be recording (recently fixed in fix peerstore apocalypse redux go-libp2p#265).

I think the most dramatic performance improvement would be to decrease the 60 second dial timeout. One suggestion would be for peers to include their RTT to each node in the response to the closest peers request. If the RTT from Alice to Bob is 200ms, and from Bob to Carol is 300ms, then a reasonable estimate can be made for the RTT from Alice to Carol, and a correspondingly lower dial timeout can be used.

I believe the real trick will be introducing finer grained timeouts. Currently, we have a global dial timeout (that includes, e.g., security transport negotiation) but we really need a shorter timeout at the TCP level. Issue: libp2p/go-tcp-transport#22


Note: Many of these issues will be fixed when we get QUIC support. As QUIC connections don't take up file descriptors, we'll be able to spam as many concurrent dials as we want and cancel all but the first that succeeds.

@MichaelMure
Copy link
Contributor

Some thought as this problem really bite me for my project.

As I understand, when a DHT query is done, we accumulate values until we get nvals = 16 of them or until the timeout of 60 seconds is triggered. As @dirkmc suggested, nvals could be reduced or made configurable. But what about, if the timeout hit, simply answer the query with the values we have, even if we don't have nvals of them ?

Another point that's worth looking into IMHO is how fast and how well values are propagated when doing a publish. In my real world full of NAT test, I have many dht queries (ipns) that don't resolve at all for a long time. But after some time (read many minutes) queries resolve decently. I have the feeling that values get replicated enough to reach the nvals threshold in enough reachable peers. I didn't get deep enough to confirm that though.

@dirkmc
Copy link
Contributor

dirkmc commented Mar 22, 2018

@MichaelMure this PR will allow clients to specify the timeout duration, and to specify the number of values to collect: ipfs/kubo#4733

@MichaelMure
Copy link
Contributor

@dirkmc this PR allow to set the level of confirmation requested for a query, but my point is different. Most of the time if you can't get to this level, you still want to use whatever information you got. Why a query should completely fail when we got only N-1 values instead of N ?

In bad network condition, answering with whatever we got when the timeout hit would make the difference between "might be degraded" and "not working at all".

@dirkmc
Copy link
Contributor

dirkmc commented Mar 27, 2018

@MichaelMure actually the query does return with the best record it has found so far when it hits the timeout:

> ./cmd/ipfs/ipfs name resolve --dhtt=5 QmSgxPeqLrnM1ZB1pno3tYMQfqLB4cWXuAQNfSWsk77apK
/ipfs/QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

Note this PR has now been merged and will go out with the next release

@MichaelMure
Copy link
Contributor

@dirkmc ho indeed, I missed a code path. As it was also working that way even before your PR, I wonder how a query can completely fail....

@dirkmc
Copy link
Contributor

dirkmc commented Mar 27, 2018

Yes, you're right it was working that way before already. My PR just allows you to set a specific timeout and a specific number of values to try to collect.

The query fails to resolve an IPNS name if for example it is unable to find any peers with a value for the target key.

@whyrusleeping
Copy link
Contributor

Just a quick update here, the current primary cause of DHT slowness at this point appears to be poor connectivity. We are working to address this by adding in better relay support throughout the network (and other longer term solutions)

@dirkmc
Copy link
Contributor

dirkmc commented May 2, 2018

@whyrusleeping have you guys done some research that indicates that poor connectivity is definitely the cause? I'm wondering how much is caused by some of the things @Stebalien mentioned above, eg gossiping bad peer addresses and ephemeral nodes

@vyzo
Copy link
Contributor Author

vyzo commented May 2, 2018

@dirkmc we have a crawler which you can use to collect network connectivity diagnostic data from your own network vantage point.

It seems that addrsplosion is under control and the majority of dials in the dht are timing out at about 1min.
It certainly explains the pattern we've been following as the network has been growing the last few months.

@dirkmc
Copy link
Contributor

dirkmc commented May 2, 2018

@vyzo I commented above about why timeouts take exactly one minute, and @Stebalien suggested that this issue should be alleviated by providing a QUIC transport because "As QUIC connections don't take up file descriptors, we'll be able to spam as many concurrent dials as we want and cancel all but the first that succeeds." Has there been any progress on adding a QUIC transport?

@vyzo
Copy link
Contributor Author

vyzo commented May 2, 2018

@dirkmc QUIC is still some ways out...

@whyrusleeping
Copy link
Contributor

@dirkmc For this issue, adding more concurrent dials won't fix it. Many of the nodes we are failing to dial have fewer addresses than the concurrency limit, its just that they don't have any addresses that work.

@ajbouh
Copy link

ajbouh commented May 3, 2018

Sorry if this is a silly comment but perhaps we should be scoring the addresses that work and gossiping those?

One way to do this is to rely on newcomers. When a new node joins the DHT, each node that it contacts during its initial join operations can ping it again some fixed number of minutes later. This helps ensure accurate mixing and mapping without adding much additional network burden.

Perhaps preferring node addresses recently gossiped via this mechanism would help spread out the work of pruning and sorting live node addresses?

@whyrusleeping
Copy link
Contributor

whyrusleeping commented May 3, 2018 via email

@ajbouh
Copy link

ajbouh commented May 3, 2018 via email

@whyrusleeping
Copy link
Contributor

Not a stupid question, the answer is "it depends". Sometimes the answer is "I dont know", sometimes the answer is "the node thinks that they can be dialed there, but they really cant", sometimes the answer is "Other nodes have told the node that they see them at that address (for outbound dials from the node)".
I think this is an important thing to dig farther into, If you're feeling like doing some analysis, there are a bunch of logs here: https://ipfs.io/ipfs/QmU3CtjYS12wUjkczBUnuRrj6CzxKLxpoVTb2ZNZ8wFaDH
and some plots that @vyzo made here: libp2p/go-libp2p-swarm#57 (comment)

@dirkmc
Copy link
Contributor

dirkmc commented May 3, 2018

@ajbouh that's the million dollar question: why exactly are these nodes unreachable? It would be great to have some solid statistics on what percentage of nodes are in each category that @whyrusleeping mentioned

@dirkmc
Copy link
Contributor

dirkmc commented May 3, 2018

@whyrusleeping currently if (for example) the first nodes to be dialled have 160 TCP addresses between them that all time out, then that clogs up all of the available file descriptors allocated to the dialer until the time out occurs (after one minute). QUIC should help alleviate this problem because as I understand it doesn't use up file descriptors, so we can dial out to as many peers as we want, the dialer doesn't get blocked up with slow connections.
Additionally @Stebalien suggested decreasing the TCP timeout to a few seconds, I'm not sure what progress has been made on the issue: libp2p/go-tcp-transport#22

@ajbouh
Copy link

ajbouh commented May 3, 2018 via email

@Stebalien
Copy link
Member

This issue is now old enough that it contains a lot of historical information that's not relevant. The main DHT query logic has been re-written.

Note: The fact that we serialize all queries to a specific peer through a single stream is still a performance bottleneck, but the real performance bottleneck was always the DHT query logic itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Critical: Tackled by core team ASAP status/ready Ready to be worked
Projects
None yet
Development

No branches or pull requests

6 participants