Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identify: stuck at reading multistream header #2379

Closed
dennis-tra opened this issue Jun 19, 2023 · 4 comments · Fixed by #2382
Closed

identify: stuck at reading multistream header #2379

dennis-tra opened this issue Jun 19, 2023 · 4 comments · Fixed by #2382

Comments

@dennis-tra
Copy link
Contributor

dennis-tra commented Jun 19, 2023

I recently updated Nebula to use go-libp2p v0.28.0. Further, when I crawl another peer, I started to explicitly wait for the Identify exchange to complete before extracting data from the peerstore.

Over the weekend, I saw that two crawls, which usually take ~5m, didn't terminate after >12h. I extracted a goroutine dump which is attached below.

From that dump, I found out that one of the 1,000 crawl-workers is waiting to receive an event from a channel:

image

I dug deeper into what it's waiting for and found that it should be this select- statement. Here the excerpt:

select {
case <-ctx.Done():
   // ...
case <-c.host.IDService().IdentifyWait(conn):
   // ...
}

So, it's waiting for the identify exchange to complete. If I interpret the goroutine dump correctly, it's hanging at:

It seems like it can't read the multistream header from the identify stream. My hypotheses: 1) the remote peer is just super slow to respond, or 2) something hung internally, or 3) something else?

I had expected some of the various timeouts across the stack to kick in (transport dial timeouts, security handshake timeouts, connection timeouts, etc.) and cancel the exchange.

My fix, for now, is to time out after 15s:

timeoutCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
defer cancel()

select {
case <-timeoutCtx.Done():
   // ...
case <-c.host.IDService().IdentifyWait(conn):
   // ...
}

However, I think this leaks resources if the above happens again.

Hyper speculative: If this proves to be not and issue on my end, could this be abused in a slowloris-like fashion?


goroutine.zip

@dennis-tra dennis-tra changed the title Identify hangs indefinitely identify: stuck at reading multistream headers Jun 19, 2023
@dennis-tra dennis-tra changed the title identify: stuck at reading multistream headers identify: stuck at reading multistream header Jun 19, 2023
@Wondertan
Copy link
Contributor

Wondertan commented Jun 19, 2023

We had similar issues a while ago, and one of the possible reasons is explained in #2361. This behavior happens when remote peer deadlocks during the connection handling in Swarm. The linked case happens because the Swarm is deadlocked on emitting the event. This recent issue may also explain it, so trying on future v0.28.1 is worth a shot.

@dennis-tra
Copy link
Contributor Author

@Wondertan thanks for the pointer 👍

This behavior happens when remote peer deadlocks during the connection handling in Swarm

Even if the remote peer hangs, I still think the local peer should terminate the stream/connection after a timeout.

@marten-seemann
Copy link
Contributor

Even if the remote peer hangs, I still think the local peer should terminate the stream/connection after a timeout.

You're right. We should set a stream deadline. Unfortunately, the msmux doesn't take a context, which would be the better way to solve this.

Maybe we should add that API, even if it means spawning an additional Go routine. This has bitten us before.

@MarcoPolo
Copy link
Collaborator

I'm still curious what caused the original hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants