Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peering mismatch #526

Open
tomaka opened this issue Oct 6, 2022 · 5 comments · Fixed by paritytech/substrate#13396
Open

Peering mismatch #526

tomaka opened this issue Oct 6, 2022 · 5 comments · Fixed by paritytech/substrate#13396
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@tomaka
Copy link
Contributor

tomaka commented Oct 6, 2022

When connecting a local Substrate node to a bootnode, the local Substrate node thinks that it has successfully peered with the bootnode, while in reality the bootnode has kicked out the Substrate node.

This has been investigated on Westend, but I've originally seen it happen on Polkadot.

The bootnode logs:

image

Here is the local Substrate node logs:

2022-10-06 13:49:56.411 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0)): Starting to connect    
2022-10-06 13:49:56.411 TRACE tokio-runtime-worker sub-libp2p: Libp2p <= Dial 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K    
2022-10-06 13:49:56.413 TRACE tokio-runtime-worker sub-libp2p: Addresses of PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"): ["/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws", "/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws", "/dns/3.westend.paritytech.net/tcp/30333", "/dns/3.westend.paritytech.net/tcp/30334/ws"]    
2022-10-06 13:49:56.413 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Dialing(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Dial failure for PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K")    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0))    
2022-10-06 13:49:56.414 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0)): Will start to connect at until Instant { tv_sec: 7908, tv_nsec: 768666620 }    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }): Connection was requested by PSM.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(0))    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(3), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(4), Dialer { address: "/dns/3.westend.paritytech.net/tcp/30333/p2p/12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K" }, ConnectionId(3)): Not requested by PSM, disabling.    
2022-10-06 13:49:56.576 DEBUG tokio-runtime-worker sub-libp2p: Libp2p => Connected(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K")) with errors: []    
2022-10-06 13:49:56.621 TRACE tokio-runtime-worker sub-libp2p: Ping time with PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"): 45.006854ms    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, ConnectionId(3)) => OpenResultOk(SetId(0))    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: External API <= Open(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(0))    
2022-10-06 13:49:56.665 DEBUG tokio-runtime-worker sync: New peer with best hash 0x861c…bc6c (12771923).    
2022-10-06 13:49:56.665 DEBUG tokio-runtime-worker sync: Connected 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1)): Enabling connections.    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(1))    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2)): Enabling connections.    
2022-10-06 13:49:56.665 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) <= Open(SetId(2))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) => OpenResultErr(SetId(1))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.716 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(1)): But peer is backed-off until Instant { tv_sec: 7912, tv_nsec: 70919904 }    
2022-10-06 13:49:56.760 TRACE tokio-runtime-worker sub-libp2p: Handler(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"), ConnectionId(3)) => OpenResultErr(SetId(2))    
2022-10-06 13:49:56.760 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(PeerId("12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K"))    
2022-10-06 13:49:56.761 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K, SetId(2)): But peer is backed-off until Instant { tv_sec: 7912, tv_nsec: 115314318 }    

(note: the logs are obviously truncated, but nothing relevant happens for several seconds afterwards)

As you can see, there's no sync: Disconnected 12D3KooWGi1tCpKXLMYED9y28QXLnwgD4neYb1Arqq4QpeV1Sv3K like expected.
The sync state machine thinks that we're still peered with the node. And this is confirmed by the fact that the local Substrate node then sends block requests to the bootnode.

The difference on the lower level is that the bootnode closes the block announces substream (SetId(0)), but the local Substrate node doesn't seem to pick up this closing.

@tomaka
Copy link
Contributor Author

tomaka commented Oct 6, 2022

Another maybe important piece of information is that I was originally debugging this between smoldot and the bootnode. The bootnode refuses smoldot (just like it refuses Substrate), and closes the substream, and smoldot picks up the substream closing.

So I would lean towards the bug being on the receiving side, where the substream is actually closed by the bootnode but Substrate doesn't detect it.

@bkchr
Copy link
Member

bkchr commented Oct 6, 2022

CC @altonen @dmitry-markin something for you to look into. :)

@bkchr bkchr added the I3-bug label Oct 6, 2022
@tomaka
Copy link
Contributor Author

tomaka commented Oct 6, 2022

To debug, I would suggest to start two Substrate nodes, on one of them replace this with if true, connect the nodes together, and check whether what happens in the NotificationsInSubstream and NotificationsOutOpen (in client/network/src/protocol/notifications/upgrade/notifications.rs) of the two nodes matches.

There are normally 3 notification substreams being opened, so if you do println-debugging should probably print the main_protocol_name that we originally pass, as it can be used to match the substreams on both sides.

@dmitry-markin
Copy link
Contributor

So far it boils down to the local node not learning that the bootnode has closed sync notifications substream (protocol index 0). This should happen here https://github.com/paritytech/substrate/blob/master/client/network/src/protocol/notifications/handler.rs#L788-L795, but poll_flush() does not return an error unless something was written to out_substream with start_send_unpin(). This may be a libp2p bug, if libp2p actually knows that the substream was closed by the remote node but only reports it after some data was written (makes sense filing libp2p bug after further investigation).

In the mean time the issue will likely be fixed by #556 when Notifications will relay the handshake to the syncing code and only reply with its own part of handshake if the connection accepted. (Currently the connection is fully negotiated first, and only then dropped by Protocol code.)

@dmitry-markin
Copy link
Contributor

Reopened, as the PR paritytech/substrate#13396 was reverted

@altonen altonen transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added I2-bug The node fails to follow expected behavior. and removed I3-bug labels Aug 25, 2023
@altonen altonen moved this from Blocked ⛔️ to Backlog 🗒 in Networking Jan 2, 2024
bkchr pushed a commit that referenced this issue Apr 10, 2024
* add more substrate accounts

* 's

* Update deployments/README.md

Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com>

* Update deployments/README.md

Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com>

* more fixes

* with messages to -> which contain messages for

Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: Backlog 🗒
Development

Successfully merging a pull request may close this issue.

5 participants