Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: set zero bandwidth hint for channels that do not have any available slots #5355

Merged
merged 3 commits into from
Jun 24, 2021

Conversation

carlaKC
Copy link
Collaborator

@carlaKC carlaKC commented Jun 3, 2021

This PR updates our bandwidth hints to report a zero balance when we're not allowed to add any more htlcs to the remote party's commitment. This prevents pathfinding from using the channel at all; previously it would endlessly try to send shard on the channel.

The first commit just demonstrates the endless cycle of htlcs by logging (couldn't think of a better way), and should be squashed with the second commit to include a test that asserts that we don't use full channels. Open to other testing options besides an itest, but it's difficult to do because this is a combination of multiple elements - pathfinding + payment lifecycle + link - so mocking this didn't feel like it would meaningfully add much (would be testing the mock rather than the code itself).

Fixes #5297
Fixes #4656

@Crypt-iQ Crypt-iQ self-requested a review June 3, 2021 14:16
lnwallet/channel.go Outdated Show resolved Hide resolved
@carlaKC carlaKC force-pushed the 4656-maxhtlcnumber branch 2 times, most recently from 1a7e5fd to 4a1d6a8 Compare June 11, 2021 07:05
@carlaKC carlaKC marked this pull request as ready for review June 11, 2021 07:06
@carlaKC carlaKC linked an issue Jun 11, 2021 that may be closed by this pull request
@Crypt-iQ Crypt-iQ added this to the v0.14.0 milestone Jun 11, 2021
Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the first commit should be dropped like you said, o/w this patch looks good!

lnwallet/channel.go Show resolved Hide resolved
lntest/itest/lnd_max_htlcs_test.go Outdated Show resolved Hide resolved
@carlaKC
Copy link
Collaborator Author

carlaKC commented Jun 11, 2021

Think the first commit should be dropped like you said, o/w this patch looks good!

WDYT about keeping the itest? Was planning on updating to just add the itest in the last commit.

@Crypt-iQ
Copy link
Collaborator

Think the first commit should be dropped like you said, o/w this patch looks good!

WDYT about keeping the itest? Was planning on updating to just add the itest in the last commit.

Yeah I think the itest is good as it exercises previously uncovered behavior

@carlaKC carlaKC force-pushed the 4656-maxhtlcnumber branch 2 times, most recently from d290346 to 0013711 Compare June 14, 2021 10:02
@carlaKC
Copy link
Collaborator Author

carlaKC commented Jun 14, 2021

Yeah I think the itest is good as it exercises previously uncovered behavior

sgtm, squashed all itest changes to last commit where we add the fix.

@carlaKC
Copy link
Collaborator Author

carlaKC commented Jun 15, 2021

@Roasbeef is this something we could consider for 0.13? When this condition hits it really hammers the db with payment attempts, with 60 second timeout locally I saw 200-300 unnecessary attempts.

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🏭 , just nits

lntest/itest/lnd_max_htlcs_test.go Outdated Show resolved Hide resolved
lntest/itest/lnd_max_htlcs_test.go Show resolved Hide resolved
lntest/itest/lnd_max_htlcs_test.go Show resolved Hide resolved
lntest/itest/lnd_max_htlcs_test.go Outdated Show resolved Hide resolved
@carlaKC
Copy link
Collaborator Author

carlaKC commented Jun 22, 2021

Rebased + addressed itest comments. Thanks for the review @Crypt-iQ !

@@ -145,6 +145,10 @@ type ChannelLink interface {
// will use this function in forwarding decisions accordingly.
EligibleToForward() bool

// MayAddOutgoingHtlc returns an error if we may not add an outgoing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we already have CheckHtlcTransit which calls canSendHTLC that eventually calls Bandwidth to ensure that the channel has enough idle bandwidth to carry the HTLC? In this case, one could say that if a channel has no free HTLC slots, then it effectively has zero bandwidth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, I think as is this addresses the case of backing off when we're the sender, but not the case where an outgoing link is fully saturated (in terms of idle HTLCs), and a new forward come across destined to that link. With the current control flow, we'll go all the way through and call AddHTLC on the link with that eventually failing and returning back a temp chan failure. However if we add this check to CheckHtlcForward, then we skip sending a message (with channel overhead, etc) to the link and can fail things back earlier at the forwarding site within the switch itself.

If we like this alternative direction, then "mock validation" via the new validateAddHtlc can be used within the link implementation (we don't need to extend the main interface) to apply this extra step of validation before the multi-hop related checks (fee and timelocks, etc).

One tradeoff with this alternative direction is that we'll allow an HTLC to proceed to the switch (it'll fail at the call to getLocalLink) vs it failing earlier at the router level (it'll get a nil policy as the bandwidth will show up as being zero). FWIW if it gets all the way here, then we'll emit a link failure event for the subscription RPC which might be useful for alerts to keep an eye on persistently full HTLC slots.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving this check into CheckHtlcForward to get the LinkFailure event, but it doesn't solve the issue of pathfinding becoming aware of the link failure and excluding the link as a possible route (which is what's making us spin out at present).

This endless storm of htlcs happens in the following case:

The queryBandwidth check felt like the best way to communicate this unavailability to RequestRoute? The alternative would be to update handleSendError's error processing to specifically understand this type of switch error and report it to mission control differently, since right now we're just logging it. This is more involved (previously discussed and decided against it iirc), but it does seem like the less hacky way to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would just like to point out that Bandwidth() is only concerned with bandwidth, and doesn't take into account space on the commitment. So the commitment could be full, but not all of its bandwidth consumed, which would be misleading if we relied on canSendHtlc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endless storm of htlcs happens in the following case:

Ah yeh this makes more sense, thanks for the breakdown.

The alternative would be to update handleSendError's error processing to specifically understand this type of switch error and report it to mission control differently, since right now we're just logging i

Don't think this is the correct route either as we'd end up "penalizing" our local channels using our existing observation system when we we're actually able to have an up-to-date view of the state since it's a local channel.

The thing that made be to a double take was the fact that we're extending a fundamental interface in the switch with a method that was effectively only put in place to solve a one-off issue. However looking at this addition from another angle, it forces callers that handle a link to be aware of that fact that we actually have two constraints w.r.t being able to forward HTLCs: the set of available bandwidth, and the current flow control constraints.

With that said, how about renaming the newly added method from MayAddOutgoingHtlc to something like func PaymentSlotAvailable(amt) bool (just a suggestion)? This reflects that the method is a best-effort attempt at claiming an available payment slots (which encapsulates things like the future dust limits, max htlc, max in flight, etc). It's best-effort as no strong guarantees are made as it's possible that a forwarded HTLC instead claims the slots, and the caller would need to be aware of that. With this interface change, we'd also start to thread through the amount as then we ensure we're able to factor in all the other flow control constraints. Actually without this, I think other infinite loops exist where adding an HTLC violates another constraint (aside from max HTLCs).

Doesn't look like we have the amount available here when we go to query for bandwidth hints, but with a small tweak we could have the bandwidthHintsMap map to a function closure (or interface) to allow it to accept an amount (and possibly later other parmaters). Looking at this section of the codebase with fresh eyes, it doesn't look like we attempt to adhere to the link level constraints enforced by the remote party w.r.t what type of HTLCs they'll accept (like the link level min HTLC param). Taking this final step to make the bandwidth hints (from the PoV of the router) aware (via direction) of these link level constraints should address this gap and any other lingering ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that made be to a double take was the fact that we're extending a fundamental interface in the switch with a method that was effectively only put in place to solve a one-off issue.

yeah def, also didn't love this approach

Doesn't look like we have the amount available here when we go to query for bandwidth hints, but with a small tweak we could have the bandwidthHintsMap map to a function closure (or interface) to allow it to accept an amount (and possibly later other parmaters).

Took a look at this approach today and it looks like a good direction to go in - there's a working branch here, although I haven't fully tested it.

One thing I ran into is that there are two cases where we use our hints:

  • Local policy: here we want to include htlc amount
  • Outgoing balance: iiuc, we don't actually want to check flow constraints here? Since it's just a total balance check

Ended up using *lnwire.Millisatoshi to infer whether we want to check flow as well, could possibly be a different method in the bandwidthHints interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outgoing balance: iiuc, we don't actually want to check flow constraints here? Since it's just a total balance check

Yeah in that case, I'd expect that the map look up is !ok (we don't define a function for that edge) so we instead fall through to using the existing balance.

closeChannelAndAssert(ctxt, t, net, net.Alice, chanPoint, false)
}

type holdSubscription struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Roasbeef
Copy link
Member

Thought about this a bit more, and I think it's important we patch this issue in the short-term as it's affecting users in the wild. As I outlined above, we can generalize things a bit and also address other possible vectors for a tight HTLC loop like violating constraints like the max amount allowed in flight. So I'm thinking we move forward with what we have here, package it in 0.13.1, and shoot for the more fundamental fix so we can land it in 0.14.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥜

@Roasbeef Roasbeef merged commit 6dd7321 into lightningnetwork:master Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants