-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: refresh peer IP during reconnect #5538
Conversation
server.go
Outdated
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
// Check if there are currently any connection requests for this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I don't fully understand is that isn't this inherently a race? As in if the address changes we only ever connect to that new address if we are right in the middle of connecting to the node anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher
?
For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )
To address this though, what do you think about:
- removing the
if p.Inbound()
here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound - also then fetching latest advertised addresses again after the backoff here
we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here
Sounds good to me.
we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run
Not sure that I follow. What is the problem if NodeAnnouncement
is received after peerTerminationWatcher
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peerTerminationWatcher
creates the connection requests that will be used to reconnect to the peer and uses the currently stored addresses for that node. If the peer changes their address, but we only get their NodeAnnouncement
that includes the new addresses after we have run peerTerminationWatcher
, then the connection requests that we created will have the incorrect addresses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then I think we need to fetch the node info from db again to get the latest address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well yes, but those latest addresses could still be old. We could get a NodeAnnouncement after we have done that. That is why I added the: r.cfg.UpdateConnRequests(pk, msg.Addresses)
above so that if we do get a NodeAnnouncement after we have already fetched from the db and created connReqs, then things can be updated. I will also add refresh from db in peerTerminationWatcher, but i dont think I can remove the UpdateConnRequests
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add refresh from db in peerTerminationWatcher
Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr
handles the retries, but it does not refresh the address. Since peerTerminationWatcher
is a wrapper above connmgr
, maybe we could put some logic there to retrieve the updated address?
Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go
file, put up some nice unit tests, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR and the PR description to explain why i think it is not enough to just refresh from the db in peerTerminationWatcher (the itest also fails if we only do this). Let me know what you think.
90c437a
to
534c16b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @bhandras :) I have fixed the nits and added a question to your one comment
server.go
Outdated
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
// Check if there are currently any connection requests for this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher
?
For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )
To address this though, what do you think about:
- removing the
if p.Inbound()
here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound - also then fetching latest advertised addresses again after the backoff here
we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run
8ca183c
to
def0707
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modifying peerTerminationWatcher
to fetch the latest net addr is good enough. In essence, a peer changing its IP is equivalent to a peer going offline then online. We need to properly handle the case where we need to clean old states (like disconnect peer, remove the links, etc), then attempt to reconnect using the new IP (also means updating states like the address in brontide's config). peerTerminationWatcher
has almost done everything already, except fetching the updated net addr.
// assertNumConnections asserts number current connections between two peers. | ||
func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode, | ||
expected int) { | ||
// assertConnected asserts that two peers are connected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor!
lntest/itest/lnd_misc_test.go
Outdated
// that this bug exists and will be changes to assert that peers are able to | ||
// reconnect in the commit that fixes the bug. | ||
func testReconnectAfterIPChange(net *lntest.NetworkHarness, t *harnessTest) { | ||
// In this test, the following network will be set up. A single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice docs!
lntest/itest/lnd_misc_test.go
Outdated
lntest.OpenChannelParams{ | ||
Amt: 1000000, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could defer closeChannelAndAssert(ctxt, t, net, net.Alice, chanPoint, false)
here for clarity, similar to how we defer shutdownAndAssert(net, t, charlie)
.
lntest/itest/lnd_misc_test.go
Outdated
// that Bob will send below after changing his listening port will be | ||
// younger than the timestamp Bob's NodeAnnouncement sent after the | ||
// channel open above. | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry not sure that I follow, but why is this needed?🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that if the 2 NodeAnnouncements about the same node are received in the same second then the second one is ignored as the node thinks it has received it already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie: i think the timestamp resolution of the NodeAnnouncement is 1 second. but will double check 👍
routing/router.go
Outdated
// If we currently have any connection requests pending for | ||
// this node then update the requests if the node addresses | ||
// have changed. | ||
r.cfg.UpdateConnRequests(pk, msg.Addresses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if router
should be handling this as it's already saving the node info to db. Some other service could just read the info from db and apply the necessary changes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my comment below, yes sure another service could read from the DB but the issue is that the other service would not know when to refresh from the db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we haven't strictly defined the overall architecture, routing
package should not be worrying about net.Addr
. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool I have moved this to the gossiper package.
server.go
Outdated
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
// Check if there are currently any connection requests for this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here
Sounds good to me.
we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run
Not sure that I follow. What is the problem if NodeAnnouncement
is received after peerTerminationWatcher
?
lntest/itest/assertions.go
Outdated
func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode, | ||
expected int) { | ||
// assertConnected asserts that two peers are connected. | ||
func assertConnected(t *harnessTest, alice, bob *lntest.HarnessNode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A second thought on this function. It appears to me that this function only asserts that alice
is connected to bob
, but not the other way around? We might have bob
listed in alice
's call of ListPeers
but no alice
in bob
's. So we need to list peers for bob
too like the original func.
routing/router.go
Outdated
// If we currently have any connection requests pending for | ||
// this node then update the requests if the node addresses | ||
// have changed. | ||
r.cfg.UpdateConnRequests(pk, msg.Addresses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we haven't strictly defined the overall architecture, routing
package should not be worrying about net.Addr
. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.
server.go
Outdated
s.mu.Lock() | ||
defer s.mu.Unlock() | ||
|
||
// Check if there are currently any connection requests for this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add refresh from db in peerTerminationWatcher
Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr
handles the retries, but it does not refresh the address. Since peerTerminationWatcher
is a wrapper above connmgr
, maybe we could put some logic there to retrieve the updated address?
Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go
file, put up some nice unit tests, etc.
23b4993
to
2f3e5b4
Compare
Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :)
Cool, thanks for the update! Left some comments, and thinking maybe it's easier to just put a loop and retry the connection inside peerTerminationWatcher
to avoid potential race, no need to worry about when the NodeAnnouncement
arrives, etc.
currnet DNS seeds when in SigNet | ||
mode](https://github.com/lightningnetwork/lnd/pull/5564). | ||
|
||
* [A bug has been fixed that would result in nodes not reconnecting to their | ||
persistent outbound peers if the peer's IP | ||
address changed](https://github.com/lightningnetwork/lnd/pull/5538) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing a dot in the end
discovery/gossiper.go
Outdated
// If we currently have any connection requests pending for | ||
// this node then update the requests if the node addresses | ||
// have changed. | ||
d.cfg.UpdateConnRequests(pk, msg.Addresses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could msg.Addresses
be empty? Do we need to call UpdateConnRequests
for every node announcement?
server.go
Outdated
// addresses and update the connection request | ||
// accordingly. | ||
advertisedAddr, err := s.fetchNodeAdvertisedAddr(pubKey) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to log the error here if it's not nil.
server.go
Outdated
@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) { | |||
|
|||
select { | |||
case <-time.After(backoff): | |||
// Once again attempt to refresh the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put the whole select
inside a for loop, so that we refresh the db to fetch the latest address after every backoff
? Seems like a simpler change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool yeah, good idea 👍
server.go
Outdated
return | ||
} | ||
|
||
connSet := make(map[string]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some docs to explain what is this for.
server.go
Outdated
Permanent: true, | ||
} | ||
|
||
s.persistentConnReqs[pubStr] = append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about s.persistentPeersBackoff
and s.persistentRetryCancels
, do we need to update them too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest change does update s.persistentPeersBackoff
but i dont htink that s.persistentRetryCancels
needs to be updated
2f3e5b4
to
64d2343
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the idea @yyforyongyu :) I have updated 👍
server.go
Outdated
@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) { | |||
|
|||
select { | |||
case <-time.After(backoff): | |||
// Once again attempt to refresh the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool yeah, good idea 👍
server.go
Outdated
Permanent: true, | ||
} | ||
|
||
s.persistentConnReqs[pubStr] = append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest change does update s.persistentPeersBackoff
but i dont htink that s.persistentRetryCancels
needs to be updated
64d2343
to
9000efb
Compare
9000efb
to
cb03a69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, much cleaner in the latest iteration. Great work @ellemouton! 🎉 (Left a few optional nits to consider.)
I think the PR is ready to get merged but could you please check those itest flakes if any of them are related?
lntest/itest/assertions.go
Outdated
) | ||
} | ||
if len(bNumPeers.Peers) != expected { | ||
|
||
for _, peer := range peers.Peers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use less copy-paste by using closures but since it's just test code it's not super important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, updated 👍
lntest/harness.go
Outdated
@@ -671,7 +671,7 @@ func (n *NetworkHarness) EnsureConnected(ctx context.Context, | |||
// NOTE: This function may block for up to 15-seconds as it will not return | |||
// until the new connection is detected as being known to both nodes. | |||
func (n *NetworkHarness) ConnectNodes(ctx context.Context, t *testing.T, | |||
a, b *HarnessNode) { | |||
a, b *HarnessNode, perm bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (no need to change) alternatively we could have ConnectNodes(...)
and ConnectNodesPerm(...)
both calling to connectNodes(..., perm bool)
.
lntest/itest/lnd_network_test.go
Outdated
}, | ||
) | ||
defer func() { | ||
ctxt, _ = context.WithTimeout(ctxb, channelCloseTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do you need this context?
cb03a69
to
663f4ab
Compare
@Roasbeef, @yyforyongyu & @bhandras: I have created #5700 as an alternative to this PR. It uses an event driven approach. |
034fe95
to
5e986db
Compare
ok, putting #5700 (big refactor) on the back burner again until 0.15 and revamping this PR instead for now. |
server.go
Outdated
} | ||
// We'll only need to re-launch a connection request if one | ||
// isn't already currently pending. | ||
if _, ok := s.persistentConnReqs[pubStr]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid change here, eliminates a bunch of nesting 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💎
One small comment, but looking really good other than that. Excited to proceed w/ that greater refactor in the next major version, but pleased with how this PR progressed in either case.
Needs a rebase! |
9c0b0c1
to
4dcc866
Compare
The assertNumConnection function currently takes in an 'expected' number of connections argument and asserts that both alice and bob only each only have that number of connections. So this fails to be useful if say alice is also connected to charlie cause then if we call assertNumConnections between alice and bob it will fail saying there are 2 connections between them since all it does is count alice's total number of connections. This commit replaces this function with 2 new functions: assertConnected which asserts that at least one connection exists between two peers and assertNotConnected which asserts that no connections exists between the two peers.
This commit adds a ConnectNodesPerm function to the itest NetworkHarness so that persistent connections between nodes can be mocked.
This commit adds an itest to demonstrate that if a peer advertises multiplie external IP addresses, then they will not all be used to reconnect to the peer during reconnection. This will be fixed in a follow-up commit.
The point of this commit is to make future commits in the same PR easier to review. All that this commit does is exit early if the peer we are considering is not persistent instead of having a bunch of logic indented in an if-clause.
This commit just ensures that we fetch the lastest advertised addresses for a peer from the db for both inbound and outbound peers. The reason for seperating this into its own commit is to make future commits in this PR easier to review.
In this commit, all advertised addresses of a peer are used during reconnection. This fixes a bug previously demonstrated in an itest.
In this commit we demonstrate a bug to show that if an inbound peer changes their listening address to one not advertised in their original NodeAnnouncement then we will not be able to reconnect to them. This bug will be fixed in a follow-up commit.
In this commit, a subscription is made to topology updates. For any NodeAnnouncements received for our peristent peers, we store their newly advertised addresses. If at the time of receiving these new addresses there are any existing connection requests, these are updated to reflect the newly advertised addresses.
4dcc866
to
317acf7
Compare
Has this been thoroughly tested? I had another IP address change on my node and around 25 of my tor peers are still disconnected after 40 hours. The new IP was announced right after the change and 1ml picked it up around 15-30 minutes later. |
Fixes #5377
With this PR, we ensure that when a persistent outbound peer changes its IP address we can reconnect to it. This is done by periodically fetching the peer's latest advertised IP address from the DB and updating the connection request to the peer accordingly.
A new itest,
reconnect_after_ip_change
demonstrates that the original bug and shows how this PR fixes the bug