-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identify: stuck at reading multistream header #2379
Comments
We had similar issues a while ago, and one of the possible reasons is explained in #2361. This behavior happens when remote peer deadlocks during the connection handling in Swarm. The linked case happens because the Swarm is deadlocked on emitting the event. This recent issue may also explain it, so trying on future v0.28.1 is worth a shot. |
@Wondertan thanks for the pointer 👍
Even if the remote peer hangs, I still think the local peer should terminate the stream/connection after a timeout. |
You're right. We should set a stream deadline. Unfortunately, the msmux doesn't take a context, which would be the better way to solve this. Maybe we should add that API, even if it means spawning an additional Go routine. This has bitten us before. |
I'm still curious what caused the original hang. |
I recently updated
Nebula
to usego-libp2p
v0.28.0. Further, when I crawl another peer, I started to explicitly wait for the Identify exchange to complete before extracting data from the peerstore.Over the weekend, I saw that two crawls, which usually take ~5m, didn't terminate after >12h. I extracted a
goroutine
dump which is attached below.From that dump, I found out that one of the 1,000 crawl-workers is waiting to receive an event from a channel:
I dug deeper into what it's waiting for and found that it should be this
select
- statement. Here the excerpt:So, it's waiting for the
identify
exchange to complete. If I interpret the goroutine dump correctly, it's hanging at:SelectProtoOrFail
readMultistreamHeader
(go-multistream
)ReadNextToken
(go-multistream
)It seems like it can't read the multistream header from the identify stream. My hypotheses: 1) the remote peer is just super slow to respond, or 2) something hung internally, or 3) something else?
I had expected some of the various timeouts across the stack to kick in (transport dial timeouts, security handshake timeouts, connection timeouts, etc.) and cancel the exchange.
My fix, for now, is to time out after 15s:
However, I think this leaks resources if the above happens again.
Hyper speculative: If this proves to be not and issue on my end, could this be abused in a slowloris-like fashion?
goroutine.zip
The text was updated successfully, but these errors were encountered: