-
-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootnode and P2P discovery doesn't work sometimes #3423
Comments
@skylenet Thanks for reporting! Could you detail a bit more the node configuration to be able to recreate locally?
A potential issue causing this is that we only request ENRs from a single bucket to the bootnode, so in tiny network we might not find any ENRs. However the |
Hey @dapplion thanks for looking into this. Here are your answers:
|
@skylenet I found that lodestar did not receive any discv5 peers, would you mind adding
|
@tuyennhv thanks for the ENV var, didn't know about that one! I've added it and collected logs from a working and a broken node again: I've left it running and every lookup Id shows the same on the failed node:
|
@wemeetagain Can you take a look at the debug logs too? |
@skylenet there are some issues with the session service being timing out and we failed the 1st lookup after 1s, from the 2nd time we got could you help try again with |
@tuyennhv some more logs with |
Thanks @skylenet a lot for the logs. Here's how the worked session happens:
In the broken session:
To proceed with this issue:
|
@tuyennhv thank you for looking into this :) Here are some sample logs from a fresh network that I just spawned up. The lodestar flags kept the same that we've been using on the previous run. The lighthouse node was launched with lighthouse# cmdline
lighthouse beacon_node
--datadir=/data
--disable-upnp
--disable-enr-auto-update
--enr-address=10.244.13.252
--enr-tcp-port=9000
--enr-udp-port=9000
--listen-address=0.0.0.0
--port=9000
--discovery-port=9000
--http
--http-address=0.0.0.0
--http-port=5052
--metrics
--metrics-address=0.0.0.0
--metrics-port=5054
--testnet-dir=/data/testnet_spec
--eth1
--eth1-endpoints=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--target-peers=500
--debug-level=trace lodestar# env vars
DEBUG=discv5:service,discv5:sessionService
# cmdline
node /usr/app/node_modules/.bin/lodestar beacon
--rootDir=/data
--network.discv5.bindAddr=/ip4/0.0.0.0/udp/9000
--network.localMultiaddrs=/ip4/0.0.0.0/tcp/9000
--enr.ip=10.244.10.81
--enr.tcp=9000
--enr.udp=9000
--api.rest.enabled=true
--api.rest.host=0.0.0.0
--api.rest.port=9596
--metrics.enabled=true
--metrics.listenAddr=0.0.0.0
--metrics.serverPort=8008
--genesisStateFile=/data/testnet_spec/genesis.ssz
--paramsFile=/data/testnet_spec/config.yaml
--network.discv5.bootEnrs=enr:-Ly4QFFuVbzqSau0O65t_DuIuM0HrJ8ZrcjTmuyqqutRQGjdC3bbj2y4rxL2blRYcE3LWGyZVf7JCang344BBXWse6MBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpAy95SQAQAQIDSPAAAAAAAAgmlkgnY0gmlwhAr0DfyJc2VjcDI1NmsxoQOjH7_20zVhSlb17Rfg6EBkfayyBguzK66WGjZGUBs_n4hzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA
--network.connectToDiscv5Bootnodes=true
--eth1.enabled
--eth1.providerUrls=http://dshackle.ethereum-private.svc.cluster.local:8545/eth
--logLevel=silly
--eth1.depositContractDeployBlock=0 LogsThe whole code to launch this is open source and I'm actively working on it atm. It depends on kubernetes: ethpandaops/ethereum-helm-charts#27 |
Thanks @skylenet for the log. It seems the We've just had a fix not to start discv5 on the 1st hearbeat, could you give that a try (by pulling our latest master or our latest docker image)? If you go with
|
@tuyennhv I've just tried it with the recent |
@wemeetagain I think we could add a condition that when contacting a node that returns 0 results we could just request all the buckets. If that's dangerous we could limit that to initial bootnodes queries only |
@tuyennhv @dapplion err.. I think the problem might still be around. Just gave it another try and I'm seeing nodes hanging again. I'll provide you with some logs in a couple of hours. |
@tuyennhv here we go: I just started 6 lodestar nodes from scratch and here are all of their logs (2/6 didn't make it): |
thanks @skylenet , this time it does show that we only issue FINDNODES 30s after we start discv5, and it's still an issue. Again, lodestar node sent authentication packet but lighthouse did not response. we are trying to set up the devnet ourself. In the mean time, could you try again with this flag |
@tuyennhv I've added
FYI: I also have other lighthouse nodes, prysm, nimbus and teku running on the same network and lodestar is currently the only one with this behavior. |
thanks @skylenet for the log, actually I don't see any lighthouse discv5 logs. Could you please also try I double checked with lighthouse and they confirm that's really the only thing to get discv5 log, here's the repo https://github.com/sigp/discv5 , I found that there's a lot of log there for example https://github.com/sigp/discv5/blob/06e8af6e80bb4cacb909ff8a4637187d7bf9f7df/src/service.rs#L1010 |
@tuyennhv I found out that I was missing the New logs: |
Looks like this is something related to rate limiting at lighthoue side.
|
@skylenet I just talk to Lighthouse team, we can disable rate limiting with |
@tuyennhv Our current discv5 implementation sends a ton of FINDNODE requests to a node. During a single random node lookup about 501 through 1 minute. Could this trigger the rate limiting of Lighthouse? How many FINDNODE requests do they allow for time unit? |
@dapplion this log this shows the rate limit params per request
@skylenet also one thing we can try is not to start all lodestar nodes at the same time, just one by one, in order to get through this total rate limit issue |
@tuyennhv @wemeetagain Tho our node should be able to recover from this, realize that the handshake is not completed and start over right? |
@tuyennhv Disabling the rate limiting on lighthouse or starting the lodestar nodes slowly one of the other would kinda work as a workaround for my case, but I think that client should be able to recover from this situation.
I agree with this. And this is also what happens on other clients. They eventually are able to connect without me having to reboot them manually. Lodestar seems to be the only one right now that gets the session stuck and never manages to connect if lighthouse throttles it. |
@skylenet lodestar is reworking on discv5 ChainSafe/discv5#155 , I think this is a good test for that PR @wemeetagain |
New discv5 version from ChainSafe/discv5#155 has been merged to Lodestar master. Could you please @tuyennhv check if the issue is still happening? |
I'm not able to reproduce the issue, @skylenet could you help us verify this again? Thanks. |
I think we can close this one. I haven't noticed this any more. Thanks guys! 💯 🎉 |
Describe the bug
I have a private cluster and I'm running multiple lodestar nodes. I've noticed that sometimes the bootnode connectiviy seems to work fine and lodestar gets peers. Other times it just doesn't and nothing else happens. See the following example of 2 nodes, where 1 is working fine and the other isn't discoverying any peers, even though they have the same bootnode configured.
Expected behavior
Bootnode should work, or on failure I should see any logs why it didn't work.
Steps to Reproduce
Config
I'm running many lodestar clients with this config. Note that each client has it's own
/data
dir and it's not shared between them.Working node that was able to discover peers
lodestar_works.log
Broken node that failed discovering peers
lodestar_fail.log
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: