-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send multiaddress together with PeerID even after the routing table refresh interval #9264
Comments
Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review.
Finally, remember to use https://discuss.ipfs.io if you just need general support. |
Copying discussion from slack. From @Stebalien The largest issue here is that libp2p won't try the DHT if it already thinks it knows a peer's addresses. This is a long standing issue that needs to be fixed by refactoring the host/network.
One solution would be to:
|
Coincidentally, I wrote up issue how to make libp2p's dial more efficient and concurrent: libp2p/go-libp2p#1785. TLDR: libp2p should sort addresses before starting to dial, and start dial attempts with short intervals in between. With this proposal, DNS address resolution will be run in parallel. One could design logic (and a whole new API) to feed in new addresses from DHT to an existing What do you think of the following logic?
Possible optimization: One start querying the DHT while the first All of this logic would live inside the routed host. This seems like a fairly self-contained change, I don't expect any large refactor to be necessary for this. @yiannisbot @Stebalien, wdyt? |
Hey! I've been taking a closer look at how Provider Records are shared when looking for a CID in the DHT and I found a possible limitation from the current Just a few pointers related to this Issue:
Since This is something that I could clearly see in my measurements summarized here. For some context, I was publishing random CIDs to the network from The following picture describes the number of peers contacted to keep the PR, that shared both I got a similar result when checking the results of the DHT lookups for the same CIDs. The major difference is that here the lookup was modified to not check the local The interesting part comes when I compare this last image with the result that the Lookup function returned (next image). Although there were always peers returning the full Coming back to the actual discussion, we could avoid having to make an extra lookup if we share the Provider's ID and the multiaddrs at least for the 12 or 24 hours after a CID is published. Which are the chances that a Publisher updates its IP after publishing content?
About @marten-seemann 's comment, I am not sure which is the most convenient way of addressing it, but being able to add multiaddrs to an ongoing connection attempt is something that could work, and that would fit a parallel lookup to find the multiaddres if we are not sure if the one reported with the PR is valid. |
Thanks folks for continuing the discussion here! Replying to a couple of notes:
What I'm arguing here is that we end up doing that second lookup anyway (assuming that the majority of content is requested >30mins after (re-)publication, which I would argue is true) so why not do it earlier. So I'm definitely in favour of the suggested logic, plus the optimisation :)
:-o This is way too long! :-D I think if a peer has not responded within 1s, the peer is slow anyway, so I would argue it's not worth the wait. Is there any other reason why this is set to such a high value? I would probably set the dial timeout to 1s (maybe a bit more) and do the optimisation (start the DHT query) at 500-750ms after the initial dial. We could set up monitoring infra to see how often do we actually get to connect through the pre-existing multiaddress vs having to fall back to the DHT (e.g., through experiments using Thunderdome) and then increase/decrease that interval.
Who would hang for a few more milliseconds? I'm not sure I'm getting this.
I think that's ^ the bottomline here. @marten-seemann do you have any idea why this is happening? Otherwise I'm in favour of:
with the caveat that we'd have to have that timeout (at half the dial timeout) to avoid hanging forever. |
Agreed that 15s is too long, but 1s is definitely too short. Keep in mind that a libp2p handshake over TCP takes at least 4 RTTs, and more if there's packet loss and you need to retransmit packets. Other transports take even more RTTs (WebSocket, WebRTC).
Not sure why the 30 min matter here. iiuc, if we request after 30 minutes, we don't get any addresses from our initial DHT query, right? We need to distinguish two cases then:
|
That's what we want to change: the peers that hold provider records return the Multiaddress of the providing peer for longer than the current 30 min interval. In this case we get an address to dial directly and if the peer has not changed their address (pretty likely for DHT Server settings) we'll be able to connect and avoid going through the DHT to do the |
I think the improvements to the dialling process are complementary to what Yiannis is suggesting. This might be obvious but the discussion seemed to gravitate to that. If I understand correctly we need to change two things:
From my understanding, the logic around IP/Transport prioritization, like the happy eyeball mechanism, would be a separate, additional improvement and not prohibitive of the above two changes, right?
I had a look at the code and indeed the effort looks manageable (I'd be happy to step in and make that addition). However, I'm wondering if the sketched logic is always the desired behaviour of consumers of the API. I'm not sure where it's used and if it could lead to undesired side effects as it's such a central API. |
Yes.
The longer the period is, the higher the probability that these address have become stale in the mean time, so implementing 2. becomes more pressing.
Happy to review a PR! I don't think there will be unintended side effects, users of a routed host should expect the host to do everything to resolve the address. But who knows what we'll discover when it's implemented... ;) |
Yup, I also think that
Great, let me check what's involved in increasing the TTL and get back here. Another thing, are the multi addresses for provider records only stored in memory? Could it then become an issue to keep those addresses for all provider records around? Not sure if we can derive a rough estimate on the increased resource consumption for a normal kubo node (whatever that is :D). |
Good points brought up! I do think that trying the existing multiaddresses first makes sense, but for how long is still to be figured out. A straightforward setting is the provider record republish interval, although that is several times higher than the 30min value we have today. Also, setting it to a fixed value leaves a magic number in the codebase :) But not sure it can be any different/dynamic - an alternative would be for provider record holders to ping the multiaddresses they have periodically and return, or not, the multiaddress. But that would be significant extra overhead for record holders, which is not worth the effort (IMO). In that sense, the timeout after which the requestor is going the DHT route is important. Would we want the DHT look up to start in parallel (and abort if the multiaddress is still valid)? Complementary things we could do, but not sure how practically useful they would be are:
|
We keep the Multiaddresses of providing peers around for much longer as this means we'll return them alongside the provider records and likely make the second DHT lookup for the peer record obsolete. The assumption is that peers don't change network addresses often. For context: ipfs/kubo#9264
Created Pull Requests for the two changes: |
We keep the Multiaddresses of providing peers around for much longer as this means we'll return them alongside the provider records and likely make the second DHT lookup for the peer record obsolete. The assumption is that peers don't change network addresses often. For context: ipfs/kubo#9264
I think this issue can be closed now? Or is something left? @yiannisbot |
This is completed now and could be closed, but let's leave it open until we have some results to show the performance impact. We'll have everything in place in order to capture the performance difference and will report here. |
@yiannisbot : I'm embarassed to say I missed this work. A couple of things:
|
Not sure about this. I didn't track it, unfortunately.
This has been forgotten, unfortunately and we didn't get the right measurement scripts in place to see the performance difference that it made in practice. It also coincided with a few other incidents at the beginning of the year, so it would be difficult to isolate the performance difference due to this anyway. Maybe that's why we didn't bother back then. In any case, this is good to be closed. |
Checklist
Description
Context
Currently, when someone is publishing something on IPFS, their multiaddress is included in the provider record for the first 10 minutes after publication (or republication) of the record. The 10 min setting comes from the routing table refresh interval. After that, it's only the PeerID that is stored on the record and served to requesting clients. So, if a client asks for the CID within the first 10 minutes of the CID publication, they get the multiaddress of the provider directly and can go connect directly. If they ask after the 10 minute mark, they get the PeerID and have to walk the DHT for a second time to map the PeerID to the multiaddress. This is to avoid the situation where a peer has changed their multiaddress (e.g., connected from a different location).
Proposal
But why can we not do both? The following sounds like a reasonable setting:
Impact & Tradeoff
This approach doesn't add any overhead, as clients have to do the DHT walk anyway (after the first 10 mins, which should be the majority of cases), according to the current setting. However, in case the multiaddress is still valid (which is not unlikely given that DHT servers are not expected to be switching their connectivity addresses very frequently), the client saves the latency of an entire DHT walk, i.e., reduces resolution time approximately to half of what it is today.
The text was updated successfully, but these errors were encountered: