-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto reconnect fails with some peers #5887
Comments
Some more info after increasing debug level for SRVR and PEER to
After this about 3 seconds pass and I get:
The node in question is Boltz, they seem to use
Another misbehaving channel log:
Then after 2 seconds:
|
thanks for the extra DBG logs @rkfg! Will look into this soon 👍 |
One more thing to add. I had a couple of cases when a peer (AcrossFIRE IIRC) got disconnected with the same EOF reason (so random connection reset I assume) and fell into the same unsuccessful reconnect loop never being able to stay connected. I waited for a bit and watched the logs, it looked like the above, no reasonable errors just connect/disconnect. Then I forced connection using I also noticed this sometimes happens during forwarding a payment so the HTLC gets stuck at my node until I reconnect one of the peers. IIRC, it was the incoming channel peer. Another time an outgoing channel peer was disconnected during rebalance, after I reconnected it manually the attempt resolved quickly and proceeded as usual. Connection reset during forward/payment isn't very rare but it usually reestablishes quickly and the process resumes. Sometimes, however, it gets stuck like this. As I understand there's already in-band pings that should prevent TCP connections from timing out. Otherwise that would explain why connection reset only happens when there's actual data to send/receive to/from the peer in question. Maybe it's important but I use OpenVPN to my VPS and forward ports to my home node using |
ok cool, i have managed to reproduce this. Seems like it has something to do with the combo of peers having both clearnet and onion addresses advertised plus using hybrid mode. Diving into this today |
ok i think I found the issue (still need to figure out the best fix):
Before this PR, the above would also happen during the initial The reason that manual connection works is that we also just attempt the one address during that. So, I see 2 possible solutions:
I think option 2 seems cleaner. But the tricky thing is that we let connMgr handle the reconnections so would need to add something that lets connMgr try for a bit and then if still no successful conn & there is a diff address to try: make connMgr cancel the current connReq and add new one. Rather than making connMgr try them all at the same time |
For the record, can confirm that this issue is fixed for me. After running the version from master all alive peers connect successfully and stay connected. Thank you for your great work 🙏 |
yay! Glad to hear @rkfg! 🚀 thanks for the feedback! |
I have a similar issue, it looks like sometimes my node can't reconnect to some peers. After a while, the channel gets force closed. 2022-05-29 23:04:10.077 [INF] PEER: unable to read message from 038fe1bd966b5cb0545963490c631eaa1924e2c4c0ea4e7dcb5d4582a1e7f2f1a5@167.235.3.234:9735: EOF |
Background
After restart lnd doesn't connect to some nodes. I use hybrid mode so Tor shouldn't be used for these. The channels are stuck in reconnect loop with little to no information. However, if I manually issue the connect command (using lncli or RTL) they connect successfully and stay online until the next lnd restart. Maybe there are some significant differences between the auto reconnect loop and the manual one? Could be related to #5632
Your environment
lnd
:healthcheck/v1.2.0-1-gcac8da819
uname -a
on *Nix):Raspbian 5.10.63-v7l+
btcd
,bitcoind
, or other backend:bitcoind v22.0
Steps to reproduce
Here's the log (there might be some unrelated messages for different channels here, the node lnd has trouble connecting to is
026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2
, the channel point is4c53746b9e98fe6979583c391d5a027ec27071bb9baebba4dcb47da67544be7e:1
):After issuing
lncli connect 026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2@104.196.200.39:9735
:After that the peer is connected and stays connected. There are 4-5 peers like that that I'm connected to. The majority of them (about 40 peers) work fine. However, if it's possible to connect to them reliably at all, just not automatically, it means the issue is still on our side.
Expected behaviour
lnd should connect by itself with no manual intervention.
Actual behaviour
I have to connect manually to those peers.
The text was updated successfully, but these errors were encountered: