Bootnode and P2P discovery doesn't work sometimes #3423

skylenet · 2021-11-09T15:17:40Z

Describe the bug

I have a private cluster and I'm running multiple lodestar nodes. I've noticed that sometimes the bootnode connectiviy seems to work fine and lodestar gets peers. Other times it just doesn't and nothing else happens. See the following example of 2 nodes, where 1 is working fine and the other isn't discoverying any peers, even though they have the same bootnode configured.

Expected behavior

Bootnode should work, or on failure I should see any logs why it didn't work.

Steps to Reproduce

Config

I'm running many lodestar clients with this config. Note that each client has it's own /data dir and it's not shared between them.

node /usr/app/node_modules/.bin/lodestar beacon
--rootDir=/data
--network.discv5.bindAddr=/ip4/0.0.0.0/udp/9000
--network.localMultiaddrs=/ip4/0.0.0.0/tcp/9000
--enr.ip=10.244.6.73
--enr.tcp=9000
--enr.udp=9000
--api.rest.enabled=true
--api.rest.host=0.0.0.0
--api.rest.port=9596
--metrics.enabled=true
--metrics.listenAddr=0.0.0.0
--metrics.serverPort=8008
--genesisStateFile=/data/testnet_spec/genesis.ssz
--paramsFile=/data/testnet_spec/config.yaml
--network.discv5.bootEnrs=enr:-Ly4QPfI8vDLJRr5z14BBEVq1bPq07BhNbCcyftRUhmFf1eWR2Wn_VVnsEelOzYecvJBROTKUsxotWYO3CKPHcmwQIQBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpAy95SQAQAQIDSPAAAAAAAAgmlkgnY0gmlwhAr0BsCJc2VjcDI1NmsxoQM-99Lt2OAst8ZF343Up2AGD_dOY1BzO85p2bGjm__AgohzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA
--network.connectToDiscv5Bootnodes=true
--eth1.enabled
--eth1.providerUrls=http://geth.ethereum-private.svc.cluster.local:8545
--logLevel=silly
--eth1.depositContractDeployBlock=0

Working node that was able to discover peers

lodestar_works.log

Broken node that failed discovering peers

lodestar_fail.log

Desktop (please complete the following information):

OS: Linux docker container
Docker image: chainsafe/lodestar:next
Version: v0.31.0/master/+93/08dbb215 (git)

The text was updated successfully, but these errors were encountered:

dapplion · 2021-11-09T17:45:04Z

@skylenet Thanks for reporting! Could you detail a bit more the node configuration to be able to recreate locally?

The bootnode is a Lodestar node?
How many nodes there are in the network in total during the test?
How many peers is the bootnode DHT aware of that are in the same network?

A potential issue causing this is that we only request ENRs from a single bucket to the bootnode, so in tiny network we might not find any ENRs. However the connectToDiscv5Bootnodes flag should still land you with one peer

skylenet · 2021-11-09T18:35:54Z

Hey @dapplion thanks for looking into this. Here are your answers:

1 ) The bootnode is a lighthouse node (max peers configured to 500)
2 & 3): The network is just starting from scratch. All nodes do wait for the lighthouse bootnode before starting. So the network size depends on the timing that the node happens to connect to the bootnode. Could range from 1 (just the bootnode) to N-1 (max network size - the node that is connecting). In my case, N is currently around 25 nodes across different clients.

twoeths · 2021-11-10T01:59:09Z

@skylenet I found that lodestar did not receive any discv5 peers, would you mind adding DEBUG=discv5:service before your command so that I can see more discv logs? I expect to see some logs like

  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 2cebc6f01b68e8331be00929f6440bfa4aa71c403524c60c80129ea064be844b +1ms
  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 03ac675cea6260f833f5ea8d562f5a21e75ca9f7005173504b717887866a7116 +3ms
  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 2066746905d616b29a21d25b6afa60dee2f3e73ac69484ee25af436252ad8a2a +1ms
Nov-10 08:56:08.839 [NETWORK]          info: PeerId 16Uiu2HAmTnbMNfQhbXtkQXjGiFZUVJfYY56VXeARx6cDqiWtsPJW, Multiaddrs /ip4/127.0.0.1/tcp/9000,/ip4/192.168.15.250/tcp/9000
Nov-10 08:56:08.840 []                 warn: Low peer count peers=0
Nov-10 08:56:08.842 []                 info: Searching peers - peers: 0 - slot: 1666780 (skipped 92) - head: 1666688 0x55a5…943f - finalized: 0xd337…da2c:52082
  discv5:service Sending PING to 03ac675cea6260f833f5ea8d562f5a21e75ca9f7005173504b717887866a7116 +458ms

skylenet · 2021-11-10T15:48:13Z

@tuyennhv thanks for the ENV var, didn't know about that one! I've added it and collected logs from a working and a broken node again:

I've left it running and every lookup Id shows the same on the failed node: Sessions is being established, request failed. Seems like the sessions wasn't totally established and got stuck in a weird state. 1-3 reboots normally help and it establishes the session, but without a reboot it's just stuck in that state.

...
2021-11-10T15:58:54.331Z discv5:service Starting a new lookup. Id: 34
2021-11-10T15:58:54.332Z discv5:service Sending lookup. Id: 34, Iteration: 1, Node: 71becdbc8ccf15794899298b6d11dd0dc6237ca3fc2ea59aa1cf181db6ecb8dd
2021-11-10T15:58:54.333Z discv5:service Sending request to node: 71becdbc8ccf15794899298b6d11dd0dc6237ca3fc2ea59aa1cf181db6ecb8dd failed: error: Session is being established, request failed
...

dapplion · 2021-11-10T18:19:44Z

@wemeetagain Can you take a look at the debug logs too?

twoeths · 2021-11-11T03:45:27Z

@skylenet there are some issues with the session service being timing out and we failed the 1st lookup after 1s, from the 2nd time we got Sessions is being established, request failed as you noted. This does not happen with Prater.

could you help try again with DEBUG=discv5:service,discv5:sessionService so that we have more logs on session services? Thanks.

skylenet · 2021-11-11T11:51:16Z

@tuyennhv some more logs with DEBUG=discv5:service,discv5:sessionService

twoeths · 2021-11-12T03:35:07Z

Thanks @skylenet a lot for the logs.

Here's how the worked session happens:

Lodestar sends FINDNODES request to the boot node
No session, lodestar creates random packet and send to the boot node
Receive WHOAREYOU request, send back handshake/auth packet
Receive NODES response, session is established

In the broken session:

On the 1st try, lodestar did not receive a WHOAREYOU request and session is deleted
On the 2nd try, lodestar did receive WHOAREYOU request and send back handshake/auth packet but then never receive any more packets/messages from the boot node. Session is not deleted so it has the log Session is being established all the time and the handshake process wasn't able to recover

To proceed with this issue:

Seems like lodestar tried to send the 1st FINDNODES messages too early when the transport is not ready

2021-11-11T11:44:03.605Z discv5:service Starting discv5 service with node id a94979c6e3449280c849ddfcde174ddadd3cd4db280193d28743aeeef291a3fb
2021-11-11T11:44:03.605Z discv5:sessionService Starting session service with node id a94979c6e3449280c849ddfcde174ddadd3cd4db280193d28743aeeef291a3fb
2021-11-11T11:44:03.611Z discv5:service Starting a new lookup. Id: 1

We need to make sure we delete the pending session after some time so that we can start the next session from scratch
On the other hand we need to figure out why the bootnode (lighthoue node?) did not accept lodestar auth packet. @skylenet are you able to get the bootnode logs? If you have the log file, I can filter the log myself (not sure if discv5 log available in the boot node as for lodestar, we have to turn on some flags to do that). If not, could you give us some instructions so that we can setup the dev network ourself?

skylenet · 2021-11-12T21:23:29Z

@tuyennhv thank you for looking into this :) Here are some sample logs from a fresh network that I just spawned up. The lodestar flags kept the same that we've been using on the previous run. The lighthouse node was launched with --debug-level=trace:

lighthouse

# cmdline
lighthouse beacon_node
--datadir=/data
--disable-upnp
--disable-enr-auto-update
--enr-address=10.244.13.252
--enr-tcp-port=9000
--enr-udp-port=9000
--listen-address=0.0.0.0
--port=9000
--discovery-port=9000
--http
--http-address=0.0.0.0
--http-port=5052
--metrics
--metrics-address=0.0.0.0
--metrics-port=5054
--testnet-dir=/data/testnet_spec
--eth1
--eth1-endpoints=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--target-peers=500
--debug-level=trace

lodestar

# env vars
DEBUG=discv5:service,discv5:sessionService
# cmdline
node /usr/app/node_modules/.bin/lodestar beacon
--rootDir=/data
--network.discv5.bindAddr=/ip4/0.0.0.0/udp/9000
--network.localMultiaddrs=/ip4/0.0.0.0/tcp/9000
--enr.ip=10.244.10.81
--enr.tcp=9000
--enr.udp=9000
--api.rest.enabled=true
--api.rest.host=0.0.0.0
--api.rest.port=9596
--metrics.enabled=true
--metrics.listenAddr=0.0.0.0
--metrics.serverPort=8008
--genesisStateFile=/data/testnet_spec/genesis.ssz
--paramsFile=/data/testnet_spec/config.yaml
--network.discv5.bootEnrs=enr:-Ly4QFFuVbzqSau0O65t_DuIuM0HrJ8ZrcjTmuyqqutRQGjdC3bbj2y4rxL2blRYcE3LWGyZVf7JCang344BBXWse6MBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpAy95SQAQAQIDSPAAAAAAAAgmlkgnY0gmlwhAr0DfyJc2VjcDI1NmsxoQOjH7_20zVhSlb17Rfg6EBkfayyBguzK66WGjZGUBs_n4hzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA
--network.connectToDiscv5Bootnodes=true
--eth1.enabled
--eth1.providerUrls=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--logLevel=silly
--eth1.depositContractDeployBlock=0

Logs

The whole code to launch this is open source and I'm actively working on it atm. It depends on kubernetes: ethpandaops/ethereum-helm-charts#27

twoeths · 2021-11-15T12:08:47Z

Thanks @skylenet for the log. It seems the --debug-level=trace flag of lighthouse does not help as discv5 is on a lower layer, I could not find any useful log there (we could deep dive into it a little bit if the below does not work).

We've just had a fix not to start discv5 on the 1st hearbeat, could you give that a try (by pulling our latest master or our latest docker image)? If you go with DEBUG=discv5:service,discv5:sessionService, the log should look something like:

021-11-15T10:10:51.530Z discv5:service Starting discv5 service with node id 5ba50119c3f282470441414915f897f587c1fb48ac2bdaef82039a9dc8053038
2021-11-15T10:10:51.531Z discv5:sessionService Starting session service with node id 5ba50119c3f282470441414915f897f587c1fb48ac2bdaef82039a9dc8053038

// notice that the below happens after 30s

2021-11-15T10:11:21.539Z discv5:service Starting a new lookup. Id: 1

skylenet · 2021-11-15T12:25:46Z

@tuyennhv I've just tried it with the recent Version: v0.32.0/master/+2/0f5ee0e5 (git) and all my lodestar nodes were able to find peers now! 👍 🎉

dapplion · 2021-11-15T17:14:01Z

@wemeetagain I think we could add a condition that when contacting a node that returns 0 results we could just request all the buckets. If that's dangerous we could limit that to initial bootnodes queries only

skylenet · 2021-11-15T21:48:15Z

@tuyennhv @dapplion err.. I think the problem might still be around. Just gave it another try and I'm seeing nodes hanging again. I'll provide you with some logs in a couple of hours.

skylenet · 2021-11-16T13:55:21Z

@tuyennhv here we go: I just started 6 lodestar nodes from scratch and here are all of their logs (2/6 didn't make it):

twoeths · 2021-11-16T15:34:47Z

thanks @skylenet , this time it does show that we only issue FINDNODES 30s after we start discv5, and it's still an issue. Again, lodestar node sent authentication packet but lighthouse did not response.

we are trying to set up the devnet ourself. In the mean time, could you try again with this flag RUST_LOG=discv5=trace added to lighthouse node? Thank you so much!

skylenet · 2021-11-16T15:58:54Z

@tuyennhv I've added RUST_LOG=discv5=trace to the lighthouse bootnode. I'm not sure if it made a difference to the lighthouse logs, but I double checked that the ENV var was set (by checking /proc/1/environ on the lighthouse container):

FYI: I also have other lighthouse nodes, prysm, nimbus and teku running on the same network and lodestar is currently the only one with this behavior.

twoeths · 2021-11-18T10:52:33Z

thanks @skylenet for the log, actually I don't see any lighthouse discv5 logs. Could you please also try RUST_LOG="discv5=debug" and RUST_LOG=discv5=debug (without the double quote?). I'm sorry that I can't find a quick/easy way to set up the test network to reproduce the issue myself.

I double checked with lighthouse and they confirm that's really the only thing to get discv5 log, here's the repo https://github.com/sigp/discv5 , I found that there's a lot of log there for example https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/service.rs#L1010

skylenet · 2021-11-19T17:37:59Z

@tuyennhv I found out that I was missing the -l flag on lighthouse which based on --help: "Enables environment logging giving access to sub-protocol logs such as discv5 and libp2p". That together with RUST_LOG=discv5=trace worked fine now :)

New logs:

twoeths · 2021-11-21T02:00:10Z

Looks like this is something related to rate limiting at lighthoue side.

lodestar log

2021-11-19T17:28:35.644Z discv5:sessionService No session established, sending a random packet to: 51e817c3af7ac0508de69b3049c1396f58a94d42c60c41b11328d9c7e6b1b948 on /ip4/10.244.10.227/udp/9000
2021-11-19T17:28:35.652Z discv5:sessionService Received a WHOAREYOU packet. source: /ip4/10.244.10.227/udp/9000
2021-11-19T17:28:35.657Z discv5:sessionService Sending authentication message: { type: 3, id: 5146642605715401423n, distances: [ 256, 255, 254 ] } to node: 51e817c3af7ac0508de69b3049c1396f58a94d42c60c41b11328d9c7e6b1b948 on /ip4/10.244.10.227/udp/9000
Nov-19 17:28:36.001 []                 �[32minfo�[39m: Searching peers - peers: 0 - slot: -13 - head: 0 0xbcea…9dc3 - finalized: 0x0000…0000:0
2021-11-19T17:28:36.647Z discv5:service Lookup Id: 1 finished, 0 total found

lighthouse log

[2021-11-19T17:28:35Z TRACE discv5::handler] Received a message without a session. Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z TRACE discv5::handler] Requesting a WHOAREYOU packet to be sent. 
[2021-11-19T17:28:35Z DEBUG discv5::service] NodeId unknown, requesting ENR. Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z DEBUG discv5::handler] Sending WHOAREYOU to Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z DEBUG discv5::socket::filter] Dropped unsolicited packet from RPC limit: 10.244.15.4 
[2021-11-19T17:28:35Z TRACE discv5::socket::recv] Packet filtered from source: 10.244.15.4:9000

twoeths · 2021-11-21T02:46:46Z

@skylenet I just talk to Lighthouse team, we can disable rate limiting with --disable-packet-filter, could you give it a try? This does not seem like a code issue at lodestar discv5 if rate limiting is an issue.

dapplion · 2021-11-23T10:07:51Z

@tuyennhv Our current discv5 implementation sends a ton of FINDNODE requests to a node. During a single random node lookup about 501 through 1 minute. Could this trigger the rate limiting of Lighthouse? How many FINDNODE requests do they allow for time unit?

twoeths · 2021-11-24T08:12:03Z

@dapplion this log Dropped unsolicited packet from RPC limit shows that the request failed due to the total limit, not node limit or ip limit https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/socket/filter/mod.rs#L134 , also this failed right after we start the node at the 1st/2nd request, we haven't abused the boot node yet

this shows the rate limit params per request
(udp request, not FINDNODE request) https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/config.rs#L104

total: 10 per sec
per node: 8 per sec
per ip: 9 per sec

@skylenet also one thing we can try is not to start all lodestar nodes at the same time, just one by one, in order to get through this total rate limit issue

dapplion · 2021-11-24T10:08:55Z

@tuyennhv @wemeetagain Tho our node should be able to recover from this, realize that the handshake is not completed and start over right?

skylenet · 2021-11-24T10:14:40Z

@tuyennhv Disabling the rate limiting on lighthouse or starting the lodestar nodes slowly one of the other would kinda work as a workaround for my case, but I think that client should be able to recover from this situation.

@tuyennhv @wemeetagain Tho our node should be able to recover from this, realize that the handshake is not completed and start over right?

I agree with this. And this is also what happens on other clients. They eventually are able to connect without me having to reboot them manually. Lodestar seems to be the only one right now that gets the session stuck and never manages to connect if lighthouse throttles it.

twoeths · 2021-11-30T02:00:36Z

@skylenet lodestar is reworking on discv5 ChainSafe/discv5#155 , I think this is a good test for that PR @wemeetagain

dapplion · 2022-02-04T10:14:58Z

New discv5 version from ChainSafe/discv5#155 has been merged to Lodestar master. Could you please @tuyennhv check if the issue is still happening?

twoeths · 2022-02-10T01:36:03Z

New discv5 version from ChainSafe/discv5#155 has been merged to Lodestar master. Could you please @tuyennhv check if the issue is still happening?

I'm not able to reproduce the issue, @skylenet could you help us verify this again? Thanks.

skylenet · 2022-02-19T10:17:57Z

I think we can close this one. I haven't noticed this any more. Thanks guys! 💯 🎉

dapplion assigned twoeths Nov 9, 2021

This was referenced Nov 12, 2021

Discv5: Do not issue FINDNODES right after the start #3429

Merged

Session is being established ChainSafe/discv5#153

Closed

philknows added the scope-networking All issues related to networking, gossip, and libp2p. label Jan 24, 2022

skylenet closed this as completed Feb 19, 2022

dapplion mentioned this issue Jan 5, 2023

Fix first query discovery delay #4980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootnode and P2P discovery doesn't work sometimes #3423

Bootnode and P2P discovery doesn't work sometimes #3423

skylenet commented Nov 9, 2021 •

edited

Loading

dapplion commented Nov 9, 2021

skylenet commented Nov 9, 2021 •

edited

Loading

twoeths commented Nov 10, 2021

skylenet commented Nov 10, 2021 •

edited

Loading

dapplion commented Nov 10, 2021

twoeths commented Nov 11, 2021

skylenet commented Nov 11, 2021

twoeths commented Nov 12, 2021

skylenet commented Nov 12, 2021

twoeths commented Nov 15, 2021 •

edited

Loading

skylenet commented Nov 15, 2021

dapplion commented Nov 15, 2021

skylenet commented Nov 15, 2021

skylenet commented Nov 16, 2021

twoeths commented Nov 16, 2021

skylenet commented Nov 16, 2021 •

edited

Loading

twoeths commented Nov 18, 2021

skylenet commented Nov 19, 2021

twoeths commented Nov 21, 2021

twoeths commented Nov 21, 2021

dapplion commented Nov 23, 2021

twoeths commented Nov 24, 2021

dapplion commented Nov 24, 2021

skylenet commented Nov 24, 2021

twoeths commented Nov 30, 2021

dapplion commented Feb 4, 2022

twoeths commented Feb 10, 2022

skylenet commented Feb 19, 2022

Bootnode and P2P discovery doesn't work sometimes #3423

Bootnode and P2P discovery doesn't work sometimes #3423

Comments

skylenet commented Nov 9, 2021 • edited Loading

Config

Working node that was able to discover peers

Broken node that failed discovering peers

dapplion commented Nov 9, 2021

skylenet commented Nov 9, 2021 • edited Loading

twoeths commented Nov 10, 2021

skylenet commented Nov 10, 2021 • edited Loading

dapplion commented Nov 10, 2021

twoeths commented Nov 11, 2021

skylenet commented Nov 11, 2021

twoeths commented Nov 12, 2021

skylenet commented Nov 12, 2021

lighthouse

lodestar

Logs

twoeths commented Nov 15, 2021 • edited Loading

skylenet commented Nov 15, 2021

dapplion commented Nov 15, 2021

skylenet commented Nov 15, 2021

skylenet commented Nov 16, 2021

twoeths commented Nov 16, 2021

skylenet commented Nov 16, 2021 • edited Loading

twoeths commented Nov 18, 2021

skylenet commented Nov 19, 2021

twoeths commented Nov 21, 2021

twoeths commented Nov 21, 2021

dapplion commented Nov 23, 2021

twoeths commented Nov 24, 2021

dapplion commented Nov 24, 2021

skylenet commented Nov 24, 2021

twoeths commented Nov 30, 2021

dapplion commented Feb 4, 2022

twoeths commented Feb 10, 2022

skylenet commented Feb 19, 2022

skylenet commented Nov 9, 2021 •

edited

Loading

skylenet commented Nov 9, 2021 •

edited

Loading

skylenet commented Nov 10, 2021 •

edited

Loading

twoeths commented Nov 15, 2021 •

edited

Loading

skylenet commented Nov 16, 2021 •

edited

Loading