Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain sync deadlocked trying to dial peers #518

Closed
whyrusleeping opened this issue Oct 31, 2019 · 10 comments
Closed

chain sync deadlocked trying to dial peers #518

whyrusleeping opened this issue Oct 31, 2019 · 10 comments

Comments

@whyrusleeping
Copy link
Member

This is really weird. Not sure yet how to handle it.

goroutine 639413 [select, 324 minutes]:
github.com/libp2p/go-yamux.(*Stream).Read(0xc006a300e0, 0xc010873b0c, 0x1, 0x1, 0x0, 0x0, 0x0)
	/home/why/gopkg/pkg/mod/github.com/libp2p/go-yamux@v1.2.3/stream.go:113 +0x1f5
github.com/libp2p/go-libp2p-swarm.(*Stream).Read(0xc00c1503c0, 0xc010873b0c, 0x1, 0x1, 0xc0004d3880, 0x7fa3af1bad98, 0x0)
	/home/why/gopkg/pkg/mod/github.com/libp2p/go-libp2p-swarm@v0.2.1/swarm_stream.go:64 +0x61
github.com/multiformats/go-multistream.(*byteReader).ReadByte(0xc00e181330, 0xc00c455228, 0x4e8998, 0x10)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/multistream.go:442 +0x63
encoding/binary.ReadUvarint(0x1cdcbc0, 0xc00e181330, 0xc00c1503c0, 0x0, 0x0)
	/home/why/go/src/encoding/binary/varint.go:110 +0x7b
github.com/multiformats/go-multistream.lpReadBuf(0x7fa3880c73e8, 0xc00c1503c0, 0xc00c1503c0, 0x7fa3880c73e8, 0xc00c1503c0, 0x0, 0x1878b60)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/multistream.go:409 +0x68
github.com/multiformats/go-multistream.ReadNextTokenBytes(0x7fa3880c73c0, 0xc00c1503c0, 0xc00c455358, 0x51435e, 0xc00c455328, 0xc00c455358, 0x4e6105)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/multistream.go:388 +0x5d
github.com/multiformats/go-multistream.ReadNextToken(...)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/multistream.go:377
github.com/multiformats/go-multistream.readMultistreamHeader(0x7fa3880c73c0, 0xc00c1503c0, 0xc00c1503c0, 0x7fa3880c73c0)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/client.go:90 +0x39
github.com/multiformats/go-multistream.SelectProtoOrFail(0x1a3993f, 0x13, 0x7fa38839c4f0, 0xc00c1503c0, 0x7fa38839c4f0, 0x0)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/client.go:31 +0xd7
github.com/multiformats/go-multistream.SelectOneOf(0xc00e1812f0, 0x1, 0x1, 0x7fa38839c4f0, 0xc00c1503c0, 0x1d0cca0, 0xc00c1503c0, 0x0, 0x0)
	/home/why/gopkg/pkg/mod/github.com/multiformats/go-multistream@v0.1.0/client.go:57 +0x69
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream(0xc00063e2c0, 0x1cfafe0, 0xc008bd4ea0, 0xc0062d99b0, 0x26, 0xc00e1812c0, 0x1, 0x1, 0x10, 0x10, ...)
	/home/why/gopkg/pkg/mod/github.com/libp2p/go-libp2p@v0.3.0/p2p/host/basic/basic_host.go:448 +0x1d7
github.com/libp2p/go-libp2p/p2p/host/routed.(*RoutedHost).NewStream(0xc00065a560, 0x1cfafe0, 0xc008bd4ea0, 0xc0062d99b0, 0x26, 0xc00e1812c0, 0x1, 0x1, 0x10, 0x183a800, ...)
	/home/why/gopkg/pkg/mod/github.com/libp2p/go-libp2p@v0.3.0/p2p/host/routed/routed.go:185 +0xf3
github.com/filecoin-project/lotus/chain.(*BlockSync).sendRequestToPeer(0xc0004f25a0, 0x1cfafe0, 0xc008bd4d20, 0xc0062d99b0, 0x26, 0xc008bd4d50, 0x2, 0x0, 0x0)
	/home/why/code/go-lotus/chain/blocksync.go:399 +0x134
github.com/filecoin-project/lotus/chain.(*BlockSync).GetBlocks(0xc0004f25a0, 0x1cfafe0, 0xc008bd4d20, 0xc00903a180, 0x3, 0x3, 0x64, 0x0, 0x0, 0x0, ...)
	/home/why/code/go-lotus/chain/blocksync.go:289 +0x2c9
github.com/filecoin-project/lotus/chain.(*Syncer).syncFork(0xc0004d0300, 0x1cfafe0, 0xc008bd4cc0, 0xc00f42d040, 0xc00b94d800, 0x4, 0x4, 0x0, 0x0, 0x0)
	/home/why/code/go-lotus/chain/sync.go:749 +0x9e
github.com/filecoin-project/lotus/chain.(*Syncer).collectHeaders(0xc0004d0300, 0x1cfafe0, 0xc008bd4cc0, 0xc00f42d040, 0xc00b94d800, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/why/code/go-lotus/chain/sync.go:737 +0xdb4
github.com/filecoin-project/lotus/chain.(*Syncer).collectChain(0xc0004d0300, 0x1cfafe0, 0xc008bd4c90, 0xc00f42d040, 0x0, 0x0)
	/home/why/code/go-lotus/chain/sync.go:884 +0x16c
github.com/filecoin-project/lotus/chain.(*Syncer).Sync(0xc0004d0300, 0x1cfafe0, 0xc00d4f4690, 0xc00f42d040, 0x0, 0x0)
	/home/why/code/go-lotus/chain/sync.go:360 +0x19c
github.com/filecoin-project/lotus/chain.(*Syncer).InformNewHead.func1(0xc0004d0300, 0x1cfaf60, 0xc000038068, 0xc00d5c05c0)
	/home/why/code/go-lotus/chain/sync.go:119 +0x61
created by github.com/filecoin-project/lotus/chain.(*Syncer).InformNewHead
	/home/why/code/go-lotus/chain/sync.go:118 +0x290
@whyrusleeping
Copy link
Member Author

cc @magik6k @Kubuxu @raulk @vyzo

@Stebalien
Copy link
Member

I have also seen this on the gateway. Is this a private network or the public libp2p network?

  • If it's a private network, something isn't resetting streams (or yamux is stalling internally).
  • If it's the public network, someone may have incorrectly implemented yamux (?).

Regardless, we should add a global stream negotiation timeout.

@Kubuxu
Copy link
Contributor

Kubuxu commented Oct 31, 2019

@Stebalien it is a private network.

@Stebalien
Copy link
Member

Ok, then we have a problem.

@raulk
Copy link
Member

raulk commented Nov 1, 2019

I wonder if I/O on the connection is still operational. It's possible that the connection is broken, but net.TCPConn does set a SO_KEEPALIVE of 15s by default, and yamux also performs keepalives every 30s, so I doubt that's the case.

Alternatively the connection could be congested. It could be the other peer that's deadlocked somehow. multistream-select v1 is an interactive protocol, but we don't set read deadlines IIRC, so if the other party is frozen, we can freeze as well.

Is this TCP or QUIC?

What could be potentially happening here is that you have a misbehaving Lotus protocol that's not emptying I/O properly, and the underlying transport hits congestion. I'm assuming the transport is TCP, because with QUIC in theory you wouldn't observe this head-of-line blocking.


My reading of the situation, to aid the debugging process:

  1. a connection existed, or was opened by NewStream.
  2. the connection was successfully upgraded, which means that the transport was operational and performing correct I/O at some point. (q: what transport is this? have you enabled QUIC on this network?)
  3. we open a new stream, send our protocol proposal and wait for an ACK/NACK that never arrives.
  4. since we don't set a read deadline, we end up freezing.

@raulk
Copy link
Member

raulk commented Nov 1, 2019

I'm working on a fix to introduce deadline/timeouts in multistream-select.

@magik6k
Copy link
Contributor

magik6k commented Nov 22, 2019

@raulk any updates here?

@magik6k
Copy link
Contributor

magik6k commented Dec 9, 2019

@whyrusleeping did anything change here?

@magik6k magik6k closed this as completed Dec 9, 2019
@magik6k magik6k reopened this Dec 9, 2019
@whyrusleeping
Copy link
Member Author

well, in any case, i havent seen this happen for a long time, and i think several related changes were merged into libp2p

@raulk
Copy link
Member

raulk commented Jun 1, 2020

SGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants