-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix potential channel announcements missing #7186
Fix potential channel announcements missing #7186
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great find and amazing detective work 💯
Just a couple of nits and one unnecessary goroutine, otherwise this looks great!
@@ -1051,76 +1122,7 @@ func (r *ChannelRouter) networkHandler() { | |||
validationBarrier.InitJobDependencies(update.msg) | |||
|
|||
r.wg.Add(1) | |||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the loop scope bug only happens if there is a for loop iterating over something rather than receiving the variable from a chan: golang/go#56010
See also this toy example that mimics this code and doesn't redefine the variable: https://go.dev/play/p/wKd82-WVsR-
vs this example that uses a for loop and does redefine the variable: https://go.dev/play/p/BfBCi5k2LNw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right. I mistakenly thought it was the from receiving from the chan. Updated the description plus the commit message.
425683f
to
54b7e1b
Compare
re adding the Some background context about the validation barrier.
But of course this won't happen, which means we can see a We also apply similar logic to the database objects,
Now come back to the timeout. I think what it means is, the time it takes for our internal system to process a message and signal complete. In particular, it's the That said, the Finally, if the |
Thanks a lot for the explanation, @yyforyongyu! That makes it very clear to me and I agree that one minute is probably more than enough for most scenarios. |
Perhaps this is related to this #6531? Here's a partial investigation of a related issue (potential variable scoping mutation issue): #7073 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one thing that can be improved in an isolated PR is that we interact with the validation barrier twice for each NodeAnnouncement
/ ChannelUpdate
/ ChannelAnnouncement
. Once in the gossiper when calling processNetworkAnnouncement
and once in the router when calling processUpdate
. I think this can be improved so that when a message from the gossiper reaches the router, it can skip the extra validation barrier step. This would make the logic a bit easier to reason about from a race-condition perspective. I'd need to check old PRs to see if it's done like this for a reason, but at a glance it seems like we can remove the second step only for gossip-> router messages
routing/router.go
Outdated
@@ -1100,6 +1100,8 @@ func (r *ChannelRouter) networkHandler() { | |||
update.msg, allowDependents, | |||
) | |||
if err != nil { | |||
log.Debugf("process network updates "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: some of the logs in this commit can be log.Errorf
instead of log.Debugf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's kinda tricky as some of the errors are somehow used as a return value than actual errors, like the ignored errors here, which means somehow it's expected? Anyway will log an error here if it's not expected.
routing/router.go
Outdated
// If allowDependents is false, it means there is an error and the | ||
// error is neither ErrIgnored or ErrOutdated. In this case, we'll log | ||
// an error and return. | ||
if !allowDependents { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this necessary since we'll already return if err != nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the logic branches and go docs. Basically we want to log an error if it's unexpected.
@@ -1401,6 +1401,8 @@ func newChanMsgStream(p *Brontide, cid lnwire.ChannelID) *msgStream { | |||
// channel announcements. | |||
func newDiscMsgStream(p *Brontide) *msgStream { | |||
apply := func(msg lnwire.Message) { | |||
// TODO(yy): `ProcessRemoteAnnouncement` returns an error chan | |||
// and we need to process it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: if we process the error chan, then I think we'll see a slowdown here. I think the only case it'd be useful is if we wanted to disconnect a peer for sending something that resulted in an error and even then, it can be accomplished in other ways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked into the whole chain here, but commenting from a code health perspective. I think the returned value that is not checked can be potentially harmful in the long run, so if it's not useful, we can stop returning it. If it's used by other callsite, maybe we can make a new method to not return this err chan.
54b7e1b
to
c431235
Compare
Would love to do this in a follow-up PR! I think the current setup makes it difficult to reason and debug. Do you think it's possible that we just get rid of the validation barrier? |
I was previously successful with fixing non-propagating updates with this one-liner. Now that a problem has been identified in this PR, does that explain why that one liner also fixed it (but probably in a round-about way)? |
I think it's a different issue? From this comment,
This is not the case in the current fix. I'd say this PR fixes a special case, where both Alice and Bob have a connection with Carol, when Alice and Bob have a channel open, its channel update messages can get ignored by Carol, causing Carol to have an incomplete graph. |
Right. To be sure, I ran the repro scenario in #6531 with this PR, and indeed, it is not fixed. |
This commit refactors the `networkHandler` to use the new method `handleNetworkUpdate`. Because the `select` is called inside a for loop, which is equivalent of firing goroutine inside range loop, it's possible that a variable used inside a previous goroutine is referencing the current one. This is now fixed by making the goroutine taking the params used for network update.
This commit moves the `shouldBroadcast` logic closer to the execution logic of deciding whether we want to broadcast the announcements. This is a pure code refactor and should make no difference in announcing message unless the `d.syncMgr.IsGraphSynced()` gives different results inside the goroutine.
Also adds a `TODO` for checking the err chan.
This commit makes the `handleChanAnnouncement` always returning `true` for messages processed but ignored by router, even when the extracted announcements are empty. Previously we'd return false when the announcements are empty, which could cause `ChannelUpdate`s being ignored since the validation barrier will signal deny for the jobs. This can easily be trigger using following setup, 1. Alice connects to Bob and open a channel. 2. Alice connects to Carol, Bob connects to Carol. 3. Once the channel is open, Alice and Bob will both announce it to Carol. At some point, we'd have the following messages in Carol's node, - Alice's ChannelAnnouncement - Alice's ChannelUpdates, for both directions - Bob's ChannelAnnouncement - Bob's ChannelUpdates, for both directions And a bug could happen, if, - Alice's ChannelAnnouncement is processed by router, hence added to db, but not reporting back to gossiper yet, so the validation barrier hasn't sent signal allow. - Bob's ChannelAnnouncement is processed by router, and returned `ErrIgnored` as the edge info is already in db, and reported back to gossiper, the validation barrier will signal deny to all the ChannelUpdates jobs. - Depending on how fast Alice's ChannelAnnouncement is processed, we may get zero to four denies to the above ChannelUpdates, causing a channel edge policy never being updated.
c431235
to
1206174
Compare
This failure has been seen in the refactored itest quite often, leading to an investigation over the gossip/routing package. This PR refactored the related code to the point where the potential bug finally manifested itself in the logs.
The actual fix was simple, as shown in the last commit. Basically, we may get a deny signal from
handleChanAnnouncement
when processing an announcement that is ignored by routing because the channel edge info was already created. Please check the commit message for details.Another fix is to replace the pattern,
with
To make sure the variable stays the same in the goroutine.