Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: refresh peer IP during reconnect #5538

Merged
merged 9 commits into from
Oct 4, 2021

Conversation

ellemouton
Copy link
Collaborator

@ellemouton ellemouton commented Jul 17, 2021

Fixes #5377

With this PR, we ensure that when a persistent outbound peer changes its IP address we can reconnect to it. This is done by periodically fetching the peer's latest advertised IP address from the DB and updating the connection request to the peer accordingly.

A new itest, reconnect_after_ip_change demonstrates that the original bug and shows how this PR fixes the bug

@Crypt-iQ Crypt-iQ added bug fix networking p2p Code related to the peer-to-peer behaviour labels Jul 19, 2021
@Roasbeef Roasbeef added this to the v0.14.0 milestone Jul 19, 2021
@ellemouton ellemouton changed the title mutli: refresh peer IP during reconnect multi: refresh peer IP during reconnect Jul 20, 2021
@Roasbeef Roasbeef requested review from yyforyongyu and bhandras July 20, 2021 17:39
lntest/itest/assertions.go Outdated Show resolved Hide resolved
lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
server.go Outdated
s.mu.Lock()
defer s.mu.Unlock()

// Check if there are currently any connection requests for this node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I don't fully understand is that isn't this inherently a race? As in if the address changes we only ever connect to that new address if we are right in the middle of connecting to the node anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher?

For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )

To address this though, what do you think about:

  1. removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
  2. also then fetching latest advertised addresses again after the backoff here

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here

Sounds good to me.

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

Not sure that I follow. What is the problem if NodeAnnouncement is received after peerTerminationWatcher?

Copy link
Collaborator Author

@ellemouton ellemouton Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peerTerminationWatcher creates the connection requests that will be used to reconnect to the peer and uses the currently stored addresses for that node. If the peer changes their address, but we only get their NodeAnnouncement that includes the new addresses after we have run peerTerminationWatcher, then the connection requests that we created will have the incorrect addresses

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then I think we need to fetch the node info from db again to get the latest address.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well yes, but those latest addresses could still be old. We could get a NodeAnnouncement after we have done that. That is why I added the: r.cfg.UpdateConnRequests(pk, msg.Addresses) above so that if we do get a NodeAnnouncement after we have already fetched from the db and created connReqs, then things can be updated. I will also add refresh from db in peerTerminationWatcher, but i dont think I can remove the UpdateConnRequests. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add refresh from db in peerTerminationWatcher

Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr handles the retries, but it does not refresh the address. Since peerTerminationWatcher is a wrapper above connmgr, maybe we could put some logic there to retrieve the updated address?

Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go file, put up some nice unit tests, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR and the PR description to explain why i think it is not enough to just refresh from the db in peerTerminationWatcher (the itest also fails if we only do this). Let me know what you think.

Copy link
Collaborator Author

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @bhandras :) I have fixed the nits and added a question to your one comment

routing/router.go Outdated Show resolved Hide resolved
server.go Outdated
s.mu.Lock()
defer s.mu.Unlock()

// Check if there are currently any connection requests for this node.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher?

For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )

To address this though, what do you think about:

  1. removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
  2. also then fetching latest advertised addresses again after the backoff here

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

@ellemouton ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 8ca183c to def0707 Compare July 22, 2021 09:12
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think modifying peerTerminationWatcher to fetch the latest net addr is good enough. In essence, a peer changing its IP is equivalent to a peer going offline then online. We need to properly handle the case where we need to clean old states (like disconnect peer, remove the links, etc), then attempt to reconnect using the new IP (also means updating states like the address in brontide's config). peerTerminationWatcher has almost done everything already, except fetching the updated net addr.

// assertNumConnections asserts number current connections between two peers.
func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode,
expected int) {
// assertConnected asserts that two peers are connected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor!

lntest/itest/assertions.go Outdated Show resolved Hide resolved
lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
// that this bug exists and will be changes to assert that peers are able to
// reconnect in the commit that fixes the bug.
func testReconnectAfterIPChange(net *lntest.NetworkHarness, t *harnessTest) {
// In this test, the following network will be set up. A single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice docs!

lntest.OpenChannelParams{
Amt: 1000000,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could defer closeChannelAndAssert(ctxt, t, net, net.Alice, chanPoint, false) here for clarity, similar to how we defer shutdownAndAssert(net, t, charlie).

// that Bob will send below after changing his listening port will be
// younger than the timestamp Bob's NodeAnnouncement sent after the
// channel open above.
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure that I follow, but why is this needed?🧐

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that if the 2 NodeAnnouncements about the same node are received in the same second then the second one is ignored as the node thinks it has received it already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie: i think the timestamp resolution of the NodeAnnouncement is 1 second. but will double check 👍

lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
// If we currently have any connection requests pending for
// this node then update the requests if the node addresses
// have changed.
r.cfg.UpdateConnRequests(pk, msg.Addresses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if router should be handling this as it's already saving the node info to db. Some other service could just read the info from db and apply the necessary changes there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment below, yes sure another service could read from the DB but the issue is that the other service would not know when to refresh from the db

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we haven't strictly defined the overall architecture, routing package should not be worrying about net.Addr. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool I have moved this to the gossiper package.

server.go Outdated
s.mu.Lock()
defer s.mu.Unlock()

// Check if there are currently any connection requests for this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here

Sounds good to me.

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

Not sure that I follow. What is the problem if NodeAnnouncement is received after peerTerminationWatcher?

func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode,
expected int) {
// assertConnected asserts that two peers are connected.
func assertConnected(t *harnessTest, alice, bob *lntest.HarnessNode,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second thought on this function. It appears to me that this function only asserts that alice is connected to bob, but not the other way around? We might have bob listed in alice's call of ListPeers but no alice in bob's. So we need to list peers for bob too like the original func.

lntest/itest/lnd_misc_test.go Outdated Show resolved Hide resolved
// If we currently have any connection requests pending for
// this node then update the requests if the node addresses
// have changed.
r.cfg.UpdateConnRequests(pk, msg.Addresses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we haven't strictly defined the overall architecture, routing package should not be worrying about net.Addr. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.

server.go Outdated
s.mu.Lock()
defer s.mu.Unlock()

// Check if there are currently any connection requests for this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add refresh from db in peerTerminationWatcher

Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr handles the retries, but it does not refresh the address. Since peerTerminationWatcher is a wrapper above connmgr, maybe we could put some logic there to retrieve the updated address?

Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go file, put up some nice unit tests, etc.

@ellemouton ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 23b4993 to 2f3e5b4 Compare August 6, 2021 14:08
@ellemouton
Copy link
Collaborator Author

Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :)

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :)

Cool, thanks for the update! Left some comments, and thinking maybe it's easier to just put a loop and retry the connection inside peerTerminationWatcher to avoid potential race, no need to worry about when the NodeAnnouncement arrives, etc.

currnet DNS seeds when in SigNet
mode](https://github.com/lightningnetwork/lnd/pull/5564).

* [A bug has been fixed that would result in nodes not reconnecting to their
persistent outbound peers if the peer's IP
address changed](https://github.com/lightningnetwork/lnd/pull/5538)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing a dot in the end

// If we currently have any connection requests pending for
// this node then update the requests if the node addresses
// have changed.
d.cfg.UpdateConnRequests(pk, msg.Addresses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could msg.Addresses be empty? Do we need to call UpdateConnRequests for every node announcement?

server.go Outdated
// addresses and update the connection request
// accordingly.
advertisedAddr, err := s.fetchNodeAdvertisedAddr(pubKey)
if err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log the error here if it's not nil.

server.go Outdated
@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) {

select {
case <-time.After(backoff):
// Once again attempt to refresh the node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put the whole select inside a for loop, so that we refresh the db to fetch the latest address after every backoff? Seems like a simpler change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool yeah, good idea 👍

server.go Outdated
return
}

connSet := make(map[string]bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some docs to explain what is this for.

server.go Outdated
Permanent: true,
}

s.persistentConnReqs[pubStr] = append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about s.persistentPeersBackoff and s.persistentRetryCancels, do we need to update them too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest change does update s.persistentPeersBackoff but i dont htink that s.persistentRetryCancels needs to be updated

Copy link
Collaborator Author

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea @yyforyongyu :) I have updated 👍

server.go Outdated
@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) {

select {
case <-time.After(backoff):
// Once again attempt to refresh the node
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool yeah, good idea 👍

server.go Outdated
Permanent: true,
}

s.persistentConnReqs[pubStr] = append(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest change does update s.persistentPeersBackoff but i dont htink that s.persistentRetryCancels needs to be updated

server.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, much cleaner in the latest iteration. Great work @ellemouton! 🎉 (Left a few optional nits to consider.)

I think the PR is ready to get merged but could you please check those itest flakes if any of them are related?

)
}
if len(bNumPeers.Peers) != expected {

for _, peer := range peers.Peers {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use less copy-paste by using closures but since it's just test code it's not super important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, updated 👍

@@ -671,7 +671,7 @@ func (n *NetworkHarness) EnsureConnected(ctx context.Context,
// NOTE: This function may block for up to 15-seconds as it will not return
// until the new connection is detected as being known to both nodes.
func (n *NetworkHarness) ConnectNodes(ctx context.Context, t *testing.T,
a, b *HarnessNode) {
a, b *HarnessNode, perm bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (no need to change) alternatively we could have ConnectNodes(...) and ConnectNodesPerm(...) both calling to connectNodes(..., perm bool).

},
)
defer func() {
ctxt, _ = context.WithTimeout(ctxb, channelCloseTimeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you need this context?

lntest/itest/lnd_network_test.go Outdated Show resolved Hide resolved
@ellemouton
Copy link
Collaborator Author

@Roasbeef, @yyforyongyu & @bhandras: I have created #5700 as an alternative to this PR. It uses an event driven approach.

@ellemouton
Copy link
Collaborator Author

ok, putting #5700 (big refactor) on the back burner again until 0.15 and revamping this PR instead for now.

server.go Outdated
}
// We'll only need to re-launch a connection request if one
// isn't already currently pending.
if _, ok := s.persistentConnReqs[pubStr]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid change here, eliminates a bunch of nesting 👍

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💎

One small comment, but looking really good other than that. Excited to proceed w/ that greater refactor in the next major version, but pleased with how this PR progressed in either case.

server.go Outdated Show resolved Hide resolved
@Roasbeef
Copy link
Member

Roasbeef commented Oct 1, 2021

Needs a rebase!

@ellemouton ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 9c0b0c1 to 4dcc866 Compare October 1, 2021 06:48
The assertNumConnection function currently takes in an 'expected' number
of connections argument and asserts that both alice and bob only each
only have that number of connections. So this fails to be useful if say
alice is also connected to charlie cause then if we call
assertNumConnections between alice and bob it will fail saying there are
2 connections between them since all it does is count alice's total
  number of connections. This commit replaces this function with 2 new
functions: assertConnected which asserts that at least one connection
exists between two peers and assertNotConnected which asserts that no
connections exists between the two peers.
This commit adds a ConnectNodesPerm function to the itest NetworkHarness
so that persistent connections between nodes can be mocked.
This commit adds an itest to demonstrate that if a peer advertises
multiplie external IP addresses, then they will not all be used to
reconnect to the peer during reconnection. This will be fixed in a
follow-up commit.
The point of this commit is to make future commits in the same PR easier
to review. All that this commit does is exit early if the peer we are
considering is not persistent instead of having a bunch of logic
indented in an if-clause.
This commit just ensures that we fetch the lastest advertised addresses
for a peer from the db for both inbound and outbound peers. The reason
for seperating this into its own commit is to make future commits in
this PR easier to review.
In this commit, all advertised addresses of a peer are used during
reconnection. This fixes a bug previously demonstrated in an itest.
In this commit we demonstrate a bug to show that if an inbound peer
changes their listening address to one not advertised in their original
NodeAnnouncement then we will not be able to reconnect to them. This bug
will be fixed in a follow-up commit.
In this commit, a subscription is made to topology updates. For any
NodeAnnouncements received for our peristent peers, we store their newly
advertised addresses. If at the time of receiving these new addresses
there are any existing connection requests, these are updated to reflect
the newly advertised addresses.
@viaj3ro
Copy link

viaj3ro commented Jan 2, 2022

Has this been thoroughly tested? I had another IP address change on my node and around 25 of my tor peers are still disconnected after 40 hours. The new IP was announced right after the change and 1ml picked it up around 15-30 minutes later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix networking p2p Code related to the peer-to-peer behaviour
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Peers are unable to reconnect and stuck in a deadlock after IP address change
8 participants