Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential channel announcements missing #7186

Merged

Conversation

yyforyongyu
Copy link
Member

This failure has been seen in the refactored itest quite often, leading to an investigation over the gossip/routing package. This PR refactored the related code to the point where the potential bug finally manifested itself in the logs.

The actual fix was simple, as shown in the last commit. Basically, we may get a deny signal from handleChanAnnouncement when processing an announcement that is ignored by routing because the channel edge info was already created. Please check the commit message for details.

Another fix is to replace the pattern,

select {
case variable := <-channel:
    go func() {
        // Some actions here...
        // Now access the variable, which might be referencing to a different one.
        foo(variable)
        // More actions here...
    }()
}

with

select {
case variable := <-channel:
    go func(v variable) {
        // Some actions here...
        // Now access the variable passed from the function.
        foo(v)
        // More actions here...
    }(v)
}

To make sure the variable stays the same in the goroutine.

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great find and amazing detective work 💯

Just a couple of nits and one unnecessary goroutine, otherwise this looks great!

discovery/gossiper.go Outdated Show resolved Hide resolved
routing/validation_barrier.go Outdated Show resolved Hide resolved
routing/validation_barrier.go Outdated Show resolved Hide resolved
routing/validation_barrier.go Outdated Show resolved Hide resolved
discovery/syncer.go Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
@@ -1051,76 +1122,7 @@ func (r *ChannelRouter) networkHandler() {
validationBarrier.InitJobDependencies(update.msg)

r.wg.Add(1)
go func() {
Copy link
Collaborator

@Crypt-iQ Crypt-iQ Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop scope bug only happens if there is a for loop iterating over something rather than receiving the variable from a chan: golang/go#56010

See also this toy example that mimics this code and doesn't redefine the variable: https://go.dev/play/p/wKd82-WVsR-

vs this example that uses a for loop and does redefine the variable: https://go.dev/play/p/BfBCi5k2LNw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right. I mistakenly thought it was the from receiving from the chan. Updated the description plus the commit message.

@yyforyongyu yyforyongyu force-pushed the fix-missing-channel-updates branch from 425683f to 54b7e1b Compare November 21, 2022 22:35
@yyforyongyu
Copy link
Member Author

re adding the waitingParentSignalTimeout cc @guggero

Some background context about the validation barrier.
To add a channel to our node, we want to see four gossip messages, ChannelAnnouncement, NodeAnnouncement, one ChannelUpdate from the origin node, and one ChannelUpdate from the origin node's peer. Ideally, we want them to come in this order,

  1. ChannelAnnouncement
  2. NodeAnnouncement
  3. two ChannelUpdates

But of course this won't happen, which means we can see a ChannelUpdate before receiving its corresponding ChannelAnnouncement. And this is where the validation barrier comes into play - we want to make sure the message's parent is being validated and processed first before we process the message itself. For ChannelUpdate, we need ChannelAnnouncement. For NodeAnnouncement, we also need ChannelAnnoucement.

We also apply similar logic to the database objects,

  • channeldb.EdgeInfo, db representation of lnwire.ChannelAnnouncement.
  • channeldb.LightningNode, db representation of lnwire.NodeAnnouncement.
  • channeldb.EdgePolicy, db representation of lnwire.ChannelUpdate.

Now come back to the timeout. I think what it means is, the time it takes for our internal system to process a message and signal complete. In particular, it's the discovery sending the message to routing, the latter adds it to the db, and signals back. There are no peer communications needed and it's just limiting the time taken for internal processing. So I don't think it's worth exposing this config value to users and, one minute should be enough to complete the jobs.

That said, the ValidationBarrier seems to be a bit outdated and may be replaced with something simpler. It was meant to serve two purposes, the first being limiting the number of concurrent jobs and the second, enforcing the validation hierarchy. I think the second purpose is somewhat already accomplished by discovery as we would save the ChannelUpdate for processing later if the edge cannot be found.

Finally, if the ChannelUpdate comes much earlier than ChannelAnnouncement, it will bypass the validation barrier as the chanEdgeDependecies is not initialized yet.

@guggero
Copy link
Collaborator

guggero commented Nov 22, 2022

Thanks a lot for the explanation, @yyforyongyu! That makes it very clear to me and I agree that one minute is probably more than enough for most scenarios.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 29, 2022

Perhaps this is related to this #6531?

Here's a partial investigation of a related issue (potential variable scoping mutation issue): #7073 (comment)

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one thing that can be improved in an isolated PR is that we interact with the validation barrier twice for each NodeAnnouncement / ChannelUpdate / ChannelAnnouncement. Once in the gossiper when calling processNetworkAnnouncement and once in the router when calling processUpdate. I think this can be improved so that when a message from the gossiper reaches the router, it can skip the extra validation barrier step. This would make the logic a bit easier to reason about from a race-condition perspective. I'd need to check old PRs to see if it's done like this for a reason, but at a glance it seems like we can remove the second step only for gossip-> router messages

@@ -1100,6 +1100,8 @@ func (r *ChannelRouter) networkHandler() {
update.msg, allowDependents,
)
if err != nil {
log.Debugf("process network updates "+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some of the logs in this commit can be log.Errorf instead of log.Debugf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's kinda tricky as some of the errors are somehow used as a return value than actual errors, like the ignored errors here, which means somehow it's expected? Anyway will log an error here if it's not expected.

// If allowDependents is false, it means there is an error and the
// error is neither ErrIgnored or ErrOutdated. In this case, we'll log
// an error and return.
if !allowDependents {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this necessary since we'll already return if err != nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic branches and go docs. Basically we want to log an error if it's unexpected.

@@ -1401,6 +1401,8 @@ func newChanMsgStream(p *Brontide, cid lnwire.ChannelID) *msgStream {
// channel announcements.
func newDiscMsgStream(p *Brontide) *msgStream {
apply := func(msg lnwire.Message) {
// TODO(yy): `ProcessRemoteAnnouncement` returns an error chan
// and we need to process it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: if we process the error chan, then I think we'll see a slowdown here. I think the only case it'd be useful is if we wanted to disconnect a peer for sending something that resulted in an error and even then, it can be accomplished in other ways

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into the whole chain here, but commenting from a code health perspective. I think the returned value that is not checked can be potentially harmful in the long run, so if it's not useful, we can stop returning it. If it's used by other callsite, maybe we can make a new method to not return this err chan.

routing/validation_barrier.go Outdated Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
@yyforyongyu yyforyongyu force-pushed the fix-missing-channel-updates branch from 54b7e1b to c431235 Compare December 5, 2022 17:43
@yyforyongyu
Copy link
Member Author

Once in the gossiper when calling processNetworkAnnouncement and once in the router when calling processUpdate. I think this can be improved so that when a message from the gossiper reaches the router, it can skip the extra validation barrier step.

Would love to do this in a follow-up PR! I think the current setup makes it difficult to reason and debug. Do you think it's possible that we just get rid of the validation barrier?

@joostjager
Copy link
Contributor

I was previously successful with fixing non-propagating updates with this one-liner. Now that a problem has been identified in this PR, does that explain why that one liner also fixed it (but probably in a round-about way)?

@yyforyongyu
Copy link
Member Author

I was previously successful with fixing non-propagating updates with this one-liner. Now that a problem has been identified in this PR, does that explain why that one liner also fixed it (but probably in a round-about way)?

I think it's a different issue? From this comment,

The missing channel updates never even left the node that sets the policy. The problem isn't propagation on the network, but the update not even being broadcast.

This is not the case in the current fix. I'd say this PR fixes a special case, where both Alice and Bob have a connection with Carol, when Alice and Bob have a channel open, its channel update messages can get ignored by Carol, causing Carol to have an incomplete graph.

@joostjager
Copy link
Contributor

Right. To be sure, I ran the repro scenario in #6531 with this PR, and indeed, it is not fixed.

This commit refactors the `networkHandler` to use the new method
`handleNetworkUpdate`. Because the `select` is called inside a for loop,
which is equivalent of firing goroutine inside range loop, it's possible
that a variable used inside a previous goroutine is referencing the
current one. This is now fixed by making the goroutine taking the params
used for network update.
This commit moves the `shouldBroadcast` logic closer to the execution
logic of deciding whether we want to broadcast the announcements. This
is a pure code refactor and should make no difference in announcing
message unless the `d.syncMgr.IsGraphSynced()` gives different results
inside the goroutine.
Also adds a `TODO` for checking the err chan.
This commit makes the `handleChanAnnouncement` always returning `true`
for messages processed but ignored by router, even when the extracted
announcements are empty.
Previously we'd return false when the announcements are empty, which
could cause `ChannelUpdate`s being ignored since the validation barrier
will signal deny for the jobs. This can easily be trigger using
following setup,
1. Alice connects to Bob and open a channel.
2. Alice connects to Carol, Bob connects to Carol.
3. Once the channel is open, Alice and Bob will both announce it to Carol.

At some point, we'd have the following messages in Carol's node,
- Alice's ChannelAnnouncement
- Alice's ChannelUpdates, for both directions
- Bob's ChannelAnnouncement
- Bob's ChannelUpdates, for both directions

And a bug could happen, if,
- Alice's ChannelAnnouncement is processed by router, hence added to db,
  but not reporting back to gossiper yet, so the validation barrier
  hasn't sent signal allow.
- Bob's ChannelAnnouncement is processed by router, and returned
  `ErrIgnored` as the edge info is already in db, and reported back to
  gossiper, the validation barrier will signal deny to all the
  ChannelUpdates jobs.
- Depending on how fast Alice's ChannelAnnouncement is processed, we may
  get zero to four denies to the above ChannelUpdates, causing a channel
  edge policy never being updated.
@yyforyongyu yyforyongyu force-pushed the fix-missing-channel-updates branch from c431235 to 1206174 Compare December 8, 2022 09:57
@guggero guggero merged commit d9febbb into lightningnetwork:master Dec 12, 2022
@yyforyongyu yyforyongyu deleted the fix-missing-channel-updates branch December 12, 2022 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants