-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peerstore reports empty protocolID list #2643
Comments
It sounds like there are 2 issues here:
Regarding (1), I suspect that this is due to the cleanup logic, although I can't see any bug. When a peer disconnects (i.e. when all of the (potentially multiple) connection we had with a peer are gone), we clean up the protocol list from the peer store after a short period. You can find the logic here:
The bug here is pretty obvious, we call Regarding (2), you're probably running into this error: go-libp2p/p2p/host/basic/basic_host.go Line 696 in fc1c9f5
I'm not sure why that happens, the peer should have completed the multistream protocol negotiation. |
I'm not sure if this is true. We clean up go-libp2p/p2p/host/pstoremanager/pstoremanager.go Lines 108 to 111 in fc1c9f5
I think there's still a race condition, since eventbus notifications are async. The peer might have reconnected, identify completed, the timer fired and This soulds like an unlikely event though, and I'm not sure if it explains what you're seeing. |
@juligasa How do you know that you had the list of protocols for the peer at one point, and that they then disappear? Do you have any logs? |
In our particular case we always check the protocols right after doing |
Do you make sure that this code is run after Identify completed? This line doesn't look correct, you'll never be able to return a stream opened after the list was clear. https://github.com/MintterHypermedia/mintter/blob/7666ed6622e48abe494da77fb946e14f9cabe92f/backend/mttnet/connect.go#L98 |
In production we have a retry loop (that eventually times out) that gets protocols every 50ms with the hope of identifying completes first, but no luck. The stream creation was for testing purposes to see if there was any hint. |
@marten-seemann I wonder if this looks similar to this problem https://discuss.ipfs.tech/t/files-are-rarely-available-through-public-gateways/17144 (which you thought might be resource manager related) |
Are you familiar with the event bus? You can subscribe to the identify event, which will fire once the list of protocols is available. |
@marten-seemann I guess that wouldn't be necessary, because libp2p currently waits for Identify to complete during go-libp2p/p2p/host/basic/basic_host.go Line 751 in 5c95834
So, looks like the problem is that sometimes we get to have peers that we've connected with, and we've done the identify with, but even after that we have 0 protocols for them in our peer store. |
I see, we could do that. However, I wonder why the response from some peers is instant, and for other peers it takes +10s¿? we give up retrying at 10 seconds which is beyond the user's expectation to connect to a peer that is apparently online. |
To summarize, the problem we're facing at Mintter is:
I was hoping that libp2p 1.32 would have fixed the issue (there's a mention of fixing the error of not waiting for transient connections, which sounded like exactly the problem we face), but according to the tests @juligasa made recently, the problem is still happening. |
I'm confused how we can get Context Deadline exceeded(from NewStream) after the peer reported zero protocols. In MintterHypermedia/mintter@playground/zero-protocols/backend/mttnet/connect.go#L86-L124, I see that the error on NewStream is not wrapped or printed anywhere. Can you print @aschmahmann that's unlikely because in that case the err is supposed to be stream reset or protocol negotiation failed. We shouldn't see Context Deadline Exceeded with that issues. |
Sure. I re-run the whole test again. But the results are a bit different from the last time. Now I get fewer zero-protocol errors, and most of them are at the beginning. At the next sync round (we do one every minute) they seem to have protocols. However, for those fewer cases where the protocol is still failing, we get three type of outcomes.
The difference at the network level with the tests in weeks before, is that by now, most of the network should have upgraded to the latest code which includes the libp2p upgrade from v0.31.0 to v0.32.1 |
Apologies for the delay here. I don't have the complete explanation of what's causing this, but my best guess currently is that the peer has gone away and is unresponsive. Can you run your tests with |
I ran the test again. This time the difference is that after zero protocols I always can open a stream with the peer reporting zero protocols. So those peers shouldn't have gone away. The timeout does not make any difference here |
As I understand, now you can open the streams so the only problem is that these peers have provided zero protocols in identify or potentially identify has failed. In the identify logs can you check if there are any entries for the peer that has zero protocols? |
In particular, I'm interested in this line: https://github.com/libp2p/go-libp2p/blob/master/p2p/protocol/identify/id.go#L399C2-L399C2 |
Correct
But none of the peers on those traces correspong with any peer reporting 0 protocols. Actually, searching for peers reporting zero protocols, have quite sensible traces, like
What about those resource limits errors? should be worried? related to this #2628 ? |
In our network, we identify peers talking
hypermedia
protocol by checking if that protocol is present in the list of protocols when connecting to a peer. We get the list of protocols by doing<caller>.Peerstore().GetProtocols(<called_pid>)
(we could be usingFirstSupportedProtocol
but for the sake of this issue, getting all supported protocols is best) after connecting with a peer on the network. The usual response is like:[/ipfs/lan/kad/1.0.0 /hypermedia/0.2.0 /ipfs/bitswap /ipfs/bitswap/1.0.0 /ipfs/bitswap/1.1.0 /ipfs/bitswap/1.2.0 /ipfs/id/push/1.0.0 /ipfs/id/1.0.0 /ipfs/ping/1.0.0 /libp2p/circuit/relay/0.2.0/stop]
However, some peers report no protocols at all
[]
(as if we were issuingRemoveProtocols
) and the only way to make them work properly again is either restarting the called node or restarting the caller node. Note that while they are in this protocolless state, any subsequent calls toNewStream()
will fail (with the errorfailed to negotiate protocol: context deadline exceeded
), which is the real blocker in our case.After restarting the caller, a new random set of peers will face the same issue.
In unitary tests, this seems to work. Any idea on what's going on?
Version Information
The text was updated successfully, but these errors were encountered: