Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootnode and P2P discovery doesn't work sometimes #3423

Closed
skylenet opened this issue Nov 9, 2021 · 28 comments
Closed

Bootnode and P2P discovery doesn't work sometimes #3423

skylenet opened this issue Nov 9, 2021 · 28 comments
Assignees
Labels
scope-networking All issues related to networking, gossip, and libp2p.

Comments

@skylenet
Copy link

skylenet commented Nov 9, 2021

Describe the bug

I have a private cluster and I'm running multiple lodestar nodes. I've noticed that sometimes the bootnode connectiviy seems to work fine and lodestar gets peers. Other times it just doesn't and nothing else happens. See the following example of 2 nodes, where 1 is working fine and the other isn't discoverying any peers, even though they have the same bootnode configured.

Expected behavior

Bootnode should work, or on failure I should see any logs why it didn't work.

Steps to Reproduce

Config

I'm running many lodestar clients with this config. Note that each client has it's own /data dir and it's not shared between them.

node /usr/app/node_modules/.bin/lodestar beacon
--rootDir=/data
--network.discv5.bindAddr=/ip4/0.0.0.0/udp/9000
--network.localMultiaddrs=/ip4/0.0.0.0/tcp/9000
--enr.ip=10.244.6.73
--enr.tcp=9000
--enr.udp=9000
--api.rest.enabled=true
--api.rest.host=0.0.0.0
--api.rest.port=9596
--metrics.enabled=true
--metrics.listenAddr=0.0.0.0
--metrics.serverPort=8008
--genesisStateFile=/data/testnet_spec/genesis.ssz
--paramsFile=/data/testnet_spec/config.yaml
--network.discv5.bootEnrs=enr:-Ly4QPfI8vDLJRr5z14BBEVq1bPq07BhNbCcyftRUhmFf1eWR2Wn_VVnsEelOzYecvJBROTKUsxotWYO3CKPHcmwQIQBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpAy95SQAQAQIDSPAAAAAAAAgmlkgnY0gmlwhAr0BsCJc2VjcDI1NmsxoQM-99Lt2OAst8ZF343Up2AGD_dOY1BzO85p2bGjm__AgohzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA
--network.connectToDiscv5Bootnodes=true
--eth1.enabled
--eth1.providerUrls=http://geth.ethereum-private.svc.cluster.local:8545
--logLevel=silly
--eth1.depositContractDeployBlock=0

Working node that was able to discover peers

lodestar_works.log

Broken node that failed discovering peers

lodestar_fail.log

Desktop (please complete the following information):

  • OS: Linux docker container
  • Docker image: chainsafe/lodestar:next
  • Version: v0.31.0/master/+93/08dbb215 (git)
@dapplion
Copy link
Contributor

dapplion commented Nov 9, 2021

@skylenet Thanks for reporting! Could you detail a bit more the node configuration to be able to recreate locally?

  • The bootnode is a Lodestar node?
  • How many nodes there are in the network in total during the test?
  • How many peers is the bootnode DHT aware of that are in the same network?

A potential issue causing this is that we only request ENRs from a single bucket to the bootnode, so in tiny network we might not find any ENRs. However the connectToDiscv5Bootnodes flag should still land you with one peer

@skylenet
Copy link
Author

skylenet commented Nov 9, 2021

Hey @dapplion thanks for looking into this. Here are your answers:

  • 1 ) The bootnode is a lighthouse node (max peers configured to 500)
  • 2 & 3): The network is just starting from scratch. All nodes do wait for the lighthouse bootnode before starting. So the network size depends on the timing that the node happens to connect to the bootnode. Could range from 1 (just the bootnode) to N-1 (max network size - the node that is connecting). In my case, N is currently around 25 nodes across different clients.

@twoeths
Copy link
Contributor

twoeths commented Nov 10, 2021

@skylenet I found that lodestar did not receive any discv5 peers, would you mind adding DEBUG=discv5:service before your command so that I can see more discv logs? I expect to see some logs like

  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 2cebc6f01b68e8331be00929f6440bfa4aa71c403524c60c80129ea064be844b +1ms
  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 03ac675cea6260f833f5ea8d562f5a21e75ca9f7005173504b717887866a7116 +3ms
  discv5:service Sending lookup. Id: 1, Iteration: 1, Node: 2066746905d616b29a21d25b6afa60dee2f3e73ac69484ee25af436252ad8a2a +1ms
Nov-10 08:56:08.839 [NETWORK]          info: PeerId 16Uiu2HAmTnbMNfQhbXtkQXjGiFZUVJfYY56VXeARx6cDqiWtsPJW, Multiaddrs /ip4/127.0.0.1/tcp/9000,/ip4/192.168.15.250/tcp/9000
Nov-10 08:56:08.840 []                 warn: Low peer count peers=0
Nov-10 08:56:08.842 []                 info: Searching peers - peers: 0 - slot: 1666780 (skipped 92) - head: 1666688 0x55a5…943f - finalized: 0xd337…da2c:52082
  discv5:service Sending PING to 03ac675cea6260f833f5ea8d562f5a21e75ca9f7005173504b717887866a7116 +458ms

@skylenet
Copy link
Author

skylenet commented Nov 10, 2021

@tuyennhv thanks for the ENV var, didn't know about that one! I've added it and collected logs from a working and a broken node again:

I've left it running and every lookup Id shows the same on the failed node: Sessions is being established, request failed. Seems like the sessions wasn't totally established and got stuck in a weird state. 1-3 reboots normally help and it establishes the session, but without a reboot it's just stuck in that state.

...
2021-11-10T15:58:54.331Z discv5:service Starting a new lookup. Id: 34
2021-11-10T15:58:54.332Z discv5:service Sending lookup. Id: 34, Iteration: 1, Node: 71becdbc8ccf15794899298b6d11dd0dc6237ca3fc2ea59aa1cf181db6ecb8dd
2021-11-10T15:58:54.333Z discv5:service Sending request to node: 71becdbc8ccf15794899298b6d11dd0dc6237ca3fc2ea59aa1cf181db6ecb8dd failed: error: Session is being established, request failed
...

@dapplion
Copy link
Contributor

@wemeetagain Can you take a look at the debug logs too?

@twoeths
Copy link
Contributor

twoeths commented Nov 11, 2021

@skylenet there are some issues with the session service being timing out and we failed the 1st lookup after 1s, from the 2nd time we got Sessions is being established, request failed as you noted. This does not happen with Prater.

could you help try again with DEBUG=discv5:service,discv5:sessionService so that we have more logs on session services? Thanks.

@skylenet
Copy link
Author

@tuyennhv some more logs with DEBUG=discv5:service,discv5:sessionService

@twoeths
Copy link
Contributor

twoeths commented Nov 12, 2021

Thanks @skylenet a lot for the logs.

Here's how the worked session happens:

  1. Lodestar sends FINDNODES request to the boot node
  2. No session, lodestar creates random packet and send to the boot node
  3. Receive WHOAREYOU request, send back handshake/auth packet
  4. Receive NODES response, session is established

In the broken session:

  • On the 1st try, lodestar did not receive a WHOAREYOU request and session is deleted
  • On the 2nd try, lodestar did receive WHOAREYOU request and send back handshake/auth packet but then never receive any more packets/messages from the boot node. Session is not deleted so it has the log Session is being established all the time and the handshake process wasn't able to recover

To proceed with this issue:

  1. Seems like lodestar tried to send the 1st FINDNODES messages too early when the transport is not ready
2021-11-11T11:44:03.605Z discv5:service Starting discv5 service with node id a94979c6e3449280c849ddfcde174ddadd3cd4db280193d28743aeeef291a3fb
2021-11-11T11:44:03.605Z discv5:sessionService Starting session service with node id a94979c6e3449280c849ddfcde174ddadd3cd4db280193d28743aeeef291a3fb
2021-11-11T11:44:03.611Z discv5:service Starting a new lookup. Id: 1
  1. We need to make sure we delete the pending session after some time so that we can start the next session from scratch
  2. On the other hand we need to figure out why the bootnode (lighthoue node?) did not accept lodestar auth packet. @skylenet are you able to get the bootnode logs? If you have the log file, I can filter the log myself (not sure if discv5 log available in the boot node as for lodestar, we have to turn on some flags to do that). If not, could you give us some instructions so that we can setup the dev network ourself?

@skylenet
Copy link
Author

@tuyennhv thank you for looking into this :) Here are some sample logs from a fresh network that I just spawned up. The lodestar flags kept the same that we've been using on the previous run. The lighthouse node was launched with --debug-level=trace:

lighthouse

# cmdline
lighthouse beacon_node
--datadir=/data
--disable-upnp
--disable-enr-auto-update
--enr-address=10.244.13.252
--enr-tcp-port=9000
--enr-udp-port=9000
--listen-address=0.0.0.0
--port=9000
--discovery-port=9000
--http
--http-address=0.0.0.0
--http-port=5052
--metrics
--metrics-address=0.0.0.0
--metrics-port=5054
--testnet-dir=/data/testnet_spec
--eth1
--eth1-endpoints=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--target-peers=500
--debug-level=trace

lodestar

# env vars
DEBUG=discv5:service,discv5:sessionService
# cmdline
node /usr/app/node_modules/.bin/lodestar beacon
--rootDir=/data
--network.discv5.bindAddr=/ip4/0.0.0.0/udp/9000
--network.localMultiaddrs=/ip4/0.0.0.0/tcp/9000
--enr.ip=10.244.10.81
--enr.tcp=9000
--enr.udp=9000
--api.rest.enabled=true
--api.rest.host=0.0.0.0
--api.rest.port=9596
--metrics.enabled=true
--metrics.listenAddr=0.0.0.0
--metrics.serverPort=8008
--genesisStateFile=/data/testnet_spec/genesis.ssz
--paramsFile=/data/testnet_spec/config.yaml
--network.discv5.bootEnrs=enr:-Ly4QFFuVbzqSau0O65t_DuIuM0HrJ8ZrcjTmuyqqutRQGjdC3bbj2y4rxL2blRYcE3LWGyZVf7JCang344BBXWse6MBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpAy95SQAQAQIDSPAAAAAAAAgmlkgnY0gmlwhAr0DfyJc2VjcDI1NmsxoQOjH7_20zVhSlb17Rfg6EBkfayyBguzK66WGjZGUBs_n4hzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA
--network.connectToDiscv5Bootnodes=true
--eth1.enabled
--eth1.providerUrls=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--logLevel=silly
--eth1.depositContractDeployBlock=0

Logs

The whole code to launch this is open source and I'm actively working on it atm. It depends on kubernetes: ethpandaops/ethereum-helm-charts#27

@twoeths
Copy link
Contributor

twoeths commented Nov 15, 2021

Thanks @skylenet for the log. It seems the --debug-level=trace flag of lighthouse does not help as discv5 is on a lower layer, I could not find any useful log there (we could deep dive into it a little bit if the below does not work).

We've just had a fix not to start discv5 on the 1st hearbeat, could you give that a try (by pulling our latest master or our latest docker image)? If you go with DEBUG=discv5:service,discv5:sessionService, the log should look something like:

021-11-15T10:10:51.530Z discv5:service Starting discv5 service with node id 5ba50119c3f282470441414915f897f587c1fb48ac2bdaef82039a9dc8053038
2021-11-15T10:10:51.531Z discv5:sessionService Starting session service with node id 5ba50119c3f282470441414915f897f587c1fb48ac2bdaef82039a9dc8053038

// notice that the below happens after 30s

2021-11-15T10:11:21.539Z discv5:service Starting a new lookup. Id: 1

@skylenet
Copy link
Author

@tuyennhv I've just tried it with the recent Version: v0.32.0/master/+2/0f5ee0e5 (git) and all my lodestar nodes were able to find peers now! 👍 🎉

@dapplion
Copy link
Contributor

@wemeetagain I think we could add a condition that when contacting a node that returns 0 results we could just request all the buckets. If that's dangerous we could limit that to initial bootnodes queries only

@skylenet
Copy link
Author

@tuyennhv @dapplion err.. I think the problem might still be around. Just gave it another try and I'm seeing nodes hanging again. I'll provide you with some logs in a couple of hours.

@skylenet
Copy link
Author

@tuyennhv here we go: I just started 6 lodestar nodes from scratch and here are all of their logs (2/6 didn't make it):

@twoeths
Copy link
Contributor

twoeths commented Nov 16, 2021

thanks @skylenet , this time it does show that we only issue FINDNODES 30s after we start discv5, and it's still an issue. Again, lodestar node sent authentication packet but lighthouse did not response.

we are trying to set up the devnet ourself. In the mean time, could you try again with this flag RUST_LOG=discv5=trace added to lighthouse node? Thank you so much!

@skylenet
Copy link
Author

skylenet commented Nov 16, 2021

@tuyennhv I've added RUST_LOG=discv5=trace to the lighthouse bootnode. I'm not sure if it made a difference to the lighthouse logs, but I double checked that the ENV var was set (by checking /proc/1/environ on the lighthouse container):

FYI: I also have other lighthouse nodes, prysm, nimbus and teku running on the same network and lodestar is currently the only one with this behavior.

@twoeths
Copy link
Contributor

twoeths commented Nov 18, 2021

thanks @skylenet for the log, actually I don't see any lighthouse discv5 logs. Could you please also try RUST_LOG="discv5=debug" and RUST_LOG=discv5=debug (without the double quote?). I'm sorry that I can't find a quick/easy way to set up the test network to reproduce the issue myself.

I double checked with lighthouse and they confirm that's really the only thing to get discv5 log, here's the repo https://github.com/sigp/discv5 , I found that there's a lot of log there for example https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/service.rs#L1010

@skylenet
Copy link
Author

@tuyennhv I found out that I was missing the -l flag on lighthouse which based on --help: "Enables environment logging giving access to sub-protocol logs such as discv5 and libp2p". That together with RUST_LOG=discv5=trace worked fine now :)

New logs:

@twoeths
Copy link
Contributor

twoeths commented Nov 21, 2021

Looks like this is something related to rate limiting at lighthoue side.

  • lodestar log
2021-11-19T17:28:35.644Z discv5:sessionService No session established, sending a random packet to: 51e817c3af7ac0508de69b3049c1396f58a94d42c60c41b11328d9c7e6b1b948 on /ip4/10.244.10.227/udp/9000
2021-11-19T17:28:35.652Z discv5:sessionService Received a WHOAREYOU packet. source: /ip4/10.244.10.227/udp/9000
2021-11-19T17:28:35.657Z discv5:sessionService Sending authentication message: { type: 3, id: 5146642605715401423n, distances: [ 256, 255, 254 ] } to node: 51e817c3af7ac0508de69b3049c1396f58a94d42c60c41b11328d9c7e6b1b948 on /ip4/10.244.10.227/udp/9000
Nov-19 17:28:36.001 []                 �[32minfo�[39m: Searching peers - peers: 0 - slot: -13 - head: 0 0xbcea…9dc3 - finalized: 0x0000…0000:0
2021-11-19T17:28:36.647Z discv5:service Lookup Id: 1 finished, 0 total found
  • lighthouse log
[2021-11-19T17:28:35Z TRACE discv5::handler] Received a message without a session. Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z TRACE discv5::handler] Requesting a WHOAREYOU packet to be sent. 
[2021-11-19T17:28:35Z DEBUG discv5::service] NodeId unknown, requesting ENR. Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z DEBUG discv5::handler] Sending WHOAREYOU to Node: 0x044e..db03, addr: 10.244.15.4:9000 
[2021-11-19T17:28:35Z DEBUG discv5::socket::filter] Dropped unsolicited packet from RPC limit: 10.244.15.4 
[2021-11-19T17:28:35Z TRACE discv5::socket::recv] Packet filtered from source: 10.244.15.4:9000 

@twoeths
Copy link
Contributor

twoeths commented Nov 21, 2021

@skylenet I just talk to Lighthouse team, we can disable rate limiting with --disable-packet-filter, could you give it a try? This does not seem like a code issue at lodestar discv5 if rate limiting is an issue.

@dapplion
Copy link
Contributor

@tuyennhv Our current discv5 implementation sends a ton of FINDNODE requests to a node. During a single random node lookup about 501 through 1 minute. Could this trigger the rate limiting of Lighthouse? How many FINDNODE requests do they allow for time unit?

@twoeths
Copy link
Contributor

twoeths commented Nov 24, 2021

@dapplion this log Dropped unsolicited packet from RPC limit shows that the request failed due to the total limit, not node limit or ip limit https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/socket/filter/mod.rs#L134 , also this failed right after we start the node at the 1st/2nd request, we haven't abused the boot node yet

this shows the rate limit params per request
(udp request, not FINDNODE request) https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/config.rs#L104

  • total: 10 per sec
  • per node: 8 per sec
  • per ip: 9 per sec

@skylenet also one thing we can try is not to start all lodestar nodes at the same time, just one by one, in order to get through this total rate limit issue

@dapplion
Copy link
Contributor

@tuyennhv @wemeetagain Tho our node should be able to recover from this, realize that the handshake is not completed and start over right?

@skylenet
Copy link
Author

@tuyennhv Disabling the rate limiting on lighthouse or starting the lodestar nodes slowly one of the other would kinda work as a workaround for my case, but I think that client should be able to recover from this situation.

@tuyennhv @wemeetagain Tho our node should be able to recover from this, realize that the handshake is not completed and start over right?

I agree with this. And this is also what happens on other clients. They eventually are able to connect without me having to reboot them manually. Lodestar seems to be the only one right now that gets the session stuck and never manages to connect if lighthouse throttles it.

@twoeths
Copy link
Contributor

twoeths commented Nov 30, 2021

@skylenet lodestar is reworking on discv5 ChainSafe/discv5#155 , I think this is a good test for that PR @wemeetagain

@philknows philknows added the scope-networking All issues related to networking, gossip, and libp2p. label Jan 24, 2022
@dapplion
Copy link
Contributor

dapplion commented Feb 4, 2022

New discv5 version from ChainSafe/discv5#155 has been merged to Lodestar master. Could you please @tuyennhv check if the issue is still happening?

@twoeths
Copy link
Contributor

twoeths commented Feb 10, 2022

New discv5 version from ChainSafe/discv5#155 has been merged to Lodestar master. Could you please @tuyennhv check if the issue is still happening?

I'm not able to reproduce the issue, @skylenet could you help us verify this again? Thanks.

@skylenet
Copy link
Author

I think we can close this one. I haven't noticed this any more. Thanks guys! 💯 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope-networking All issues related to networking, gossip, and libp2p.
Projects
None yet
Development

No branches or pull requests

4 participants