Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handshake protocol: cannot decode stage 1/2 MessageBuffer: length less than 96 bytes #5372

Closed
prestonvanloon opened this issue Apr 10, 2020 · 8 comments
Assignees
Labels
Blocked Blocked by research or external factors Bug Something isn't working Priority: High High priority item

Comments

@prestonvanloon
Copy link
Member

Observed in internal testnet today. Affected node cannot peer with other nodes.

{"activePeers":0,"direction":2,"error":"failed to dial : all dials failed\n  * [/ip4/10.0.5.181/tcp/5000] failed to negotiate security protocol: runHandshake initiator stage 1 fail: recvHandshakeMessage decode
msg err=cannot decode stage 1/2 MessageBuffer: length less than 96 bytes","message":"Handshake failed","multiAddr":"/ip4/10.0.5.181/tcp/5000/p2p/16Uiu2HAmGBHq5gWCBkwR2v2cBni45TUzk6TSuBP6darEPEbgKbUb","peer":"16Uiu2HAmGBHq5gWCBkwR2v2cBni45TUzk6TSuBP6darEPEbgKbUb","prefix":"p2p","severity":"DEBUG"}

Running from commit 9cf22ac

@prestonvanloon prestonvanloon added the Bug Something isn't working label Apr 10, 2020
@yusefnapora
Copy link

Hey @prestonvanloon - this seems like it could be a bug in go-libp2p-noise. Do you have any more details about when this is happening, or more log output?

Are you able to reproduce this locally?

@prestonvanloon
Copy link
Member Author

@yusefnapora I have not personally encountered this locally, but it was happening very consistently after restarting Prysm in our internal testnet. We would have to restart the process 3 or 4 times before it would work.

Do you have any advice on how to troubleshoot this issue? Thanks

@prestonvanloon
Copy link
Member Author

@rauljordan updated some libp2p dependencies and now we have a slightly different message

 [2020-04-10 14:36:29] DEBUG p2p: Handshake failed activePeers=0 direction=2 error=failed to dial : all dials failed
  * [/ip4/192.168.4.38/tcp/13000] failed to negotiate security protocol: error reading handshake message: noise: message is too short multiAddr=/ip4/192.168.4.38/tcp/13000/p2p/16Uiu2HAm1PULWpBCFrtvmGC26QT54iryMz1Gyraf16n57ir58bxN peer=16Uiu2HAm1PULWpBCFrtvmGC26QT54iryMz1Gyraf16n57ir58bxN

@prestonvanloon
Copy link
Member Author

I believe this is where the error originates from: https://github.com/flynn/noise/blob/2492fe189ae688d7edbeae0fd575de2f1c5fec8e/state.go#L420

One of the concerns here is that libp2p did not fallback to secio when we had options both options given to the host. This failure rendered the host unable to peer with anyone.

@prestonvanloon prestonvanloon added the Priority: Critical Highest, immediate priority item label Apr 12, 2020
@prestonvanloon
Copy link
Member Author

@rauljordan mentioned that this issue is very easy to produce locally and happens quite often.

This is the last testnet restart blocking items so this is high priority for us. Do you have any suggestions or feedback @yusefnapora?

@nisdas
Copy link
Member

nisdas commented Apr 12, 2020

Update on this issue, it seems both the local and remote peer regard each other as initiating the outbound handshake. Due to both peers marking themselves as the initiator this leads to a malformed handshake.

Added in my own custom logging in tcp-tpt to illustrate the outbound connection.

Peer A:

17:34:00.901 DEBUG     swarm2: [16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj] opening stream to peer [16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT] swarm.go:292
17:34:00.901 DEBUG     swarm2: [16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj] swarm dialing peer [16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT] swarm_dial.go:193
17:34:00.901 DEBUG net/identi: error opening initial stream for /ipfs/id/1.0.0: session shutdown id.go:187
17:34:00.901 DEBUG     swarm2: 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj swarm dialing 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT swarm_dial.go:371
17:34:00.901 DEBUG     swarm2: [limiter] adding a dial job through limiter: /ip4/192.168.1.104/tcp/13001 limiter.go:192
17:34:00.901 DEBUG     swarm2: [limiter] taking FD token: peer: 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT; addr: /ip4/192.168.1.104/tcp/13001; prev consuming: 0 limiter.go:160
17:34:00.901 DEBUG     swarm2: [limiter] executing dial; peer: 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT; addr: /ip4/192.168.1.104/tcp/13001; FD consuming: 1; waiting: 0 limiter.go:166
17:34:00.901 DEBUG     swarm2: 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj swarm dialing 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT /ip4/192.168.1.104/tcp/13001 swarm_dial.go:455
17:34:00.901 DEBUG    tcp-tpt: Applying outbound upgrader for remote peer 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT tcp.go:107
17:34:00.902 DEBUG     swarm2: [limiter] freeing FD token; waiting: 0; consuming: 1 limiter.go:82
17:34:00.902 DEBUG     swarm2: [limiter] freeing peer token; peer 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT; addr: /ip4/192.168.1.104/tcp/13001; active for peer: 1; waiting on peer limit: 0 limiter.go:109
17:34:00.902  INFO     swarm2: got error on dial: failed to negotiate security protocol: runHandshake initiator stage 1 fail: recvHandshakeMessage decode msg err=cannot decode stage 1/2 MessageBuffer: length less than 96 bytes swarm_dial.go:420
17:34:00.902 DEBUG     swarm2: [limiter] clearing all peer dials: 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT limiter.go:200
17:34:00.902 WARNI     pubsub: opening new stream to peer:  failed to dial : all dials failed
  * [/ip4/192.168.1.104/tcp/13001] failed to negotiate security protocol: runHandshake initiator stage 1 fail: recvHandshakeMessage decode msg err=cannot decode stage 1/2 MessageBuffer: length less than 96 bytes 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT comm.go:63

Peer B

17:34:00.901 DEBUG net/identi: error opening initial stream for /ipfs/id/1.0.0: error writing length prefix: write tcp4 192.168.1.104:13001->192.168.1.104:52210: write: connection reset by peer id.go:187
17:34:00.901 DEBUG     swarm2: [16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT] opening stream to peer [16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj] swarm.go:292
17:34:00.901 DEBUG     swarm2: [16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT] swarm dialing peer [16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj] swarm_dial.go:193
17:34:00.901 DEBUG     swarm2: [limiter] freeing FD token; waiting: 0; consuming: 1 limiter.go:82
17:34:00.901 DEBUG     swarm2: [limiter] freeing peer token; peer 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj; addr: /ip4/192.168.1.104/tcp/13000; active for peer: 1; waiting on peer limit: 0 limiter.go:109
17:34:00.901 DEBUG     swarm2: 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT swarm dialing 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj swarm_dial.go:371
17:34:00.901 DEBUG     swarm2: [limiter] adding a dial job through limiter: /ip4/192.168.1.104/tcp/13000 limiter.go:192
17:34:00.901 DEBUG     swarm2: [limiter] taking FD token: peer: 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj; addr: /ip4/192.168.1.104/tcp/13000; prev consuming: 0 limiter.go:160
17:34:00.901 DEBUG     swarm2: [limiter] executing dial; peer: 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj; addr: /ip4/192.168.1.104/tcp/13000; FD consuming: 1; waiting: 0 limiter.go:166
17:34:00.901 DEBUG     swarm2: 16Uiu2HAmBqdwBtkYHRUdTSNnJkb21EdAMAghcvfFby9hfo7TfpwT swarm dialing 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj /ip4/192.168.1.104/tcp/13000 swarm_dial.go:455
17:34:00.901 DEBUG    tcp-tpt: Applying outbound upgrader for remote peer 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj tcp.go:107
17:34:00.902 DEBUG     swarm2: [limiter] freeing FD token; waiting: 0; consuming: 1 limiter.go:82
17:34:00.902 DEBUG     swarm2: [limiter] freeing peer token; peer 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj; addr: /ip4/192.168.1.104/tcp/13000; active for peer: 1; waiting on peer limit: 0 limiter.go:109
17:34:00.902  INFO     swarm2: got error on dial: failed to negotiate security protocol: runHandshake initiator stage 1 fail: recvHandshakeMessage decode msg err=cannot decode stage 1/2 MessageBuffer: length less than 96 bytes swarm_dial.go:420
17:34:00.902 DEBUG     swarm2: [limiter] clearing all peer dials: 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj limiter.go:200
17:34:00.902 WARNI     pubsub: opening new stream to peer:  failed to dial : all dials failed
  * [/ip4/192.168.1.104/tcp/13000] failed to negotiate security protocol: runHandshake initiator stage 1 fail: recvHandshakeMessage decode msg err=cannot decode stage 1/2 MessageBuffer: length less than 96 bytes 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj comm.go:63
17:34:00.902 DEBUG     pubsub: PEERDOWN: Remove disconnected peer 16Uiu2HAmG5fvaMdfYroVxYaELR9No3B18XWbWjfESuYiB3MZHQxj gossipsub.go:90

It seems like both of these peers dial each other at the same exact time and then regard themselves as initiator of the connection and handshake. I am not sure how this is possible ? But it would point
to the issue potentially being in swarm. @yusefnapora any ideas ?

You can probably reproduce this by having a libp2p host with Noise enabled. Just disconnecting one peer and reconnecting it a few times should lead to this issue coming up.

@yusefnapora
Copy link

Hey, sorry I wasn't able to help much on Friday & thanks for digging deeper and discovering that it's a simultaneous connection issue.

I think we can fix this by detecting the size mismatch and having one peer re-initialize the handshake as a responder. I'll start on this today & keep you guys posted.

@rauljordan rauljordan added Priority: High High priority item and removed Priority: Critical Highest, immediate priority item labels Apr 13, 2020
@rauljordan rauljordan added this to the Multiclient Readiness milestone Apr 13, 2020
@nisdas nisdas added the Blocked Blocked by research or external factors label Apr 30, 2020
@rauljordan
Copy link
Contributor

Resolved in v0.12, now running in dev testnet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Blocked by research or external factors Bug Something isn't working Priority: High High priority item
Projects
None yet
Development

No branches or pull requests

4 participants