-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
htlcswitch: reliable HTLC tracking #2265
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@halseth did an initial pass, great commits!
routing/router.go
Outdated
|
||
// ErrRouterShuttingDown is returned if the router is in the process of | ||
// shutting down. | ||
ErrRouterShuttingDown = fmt.Errorf("router shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
routing/router.go
Outdated
case preImage = <-preImageChan: | ||
case sendError = <-errChan: | ||
case <-r.quit: | ||
return preImage, nil, fmt.Errorf("router shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ErrRouterShuttingDown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, rebase fail
htlcswitch/switch.go
Outdated
// Before sending, double check that we don't already have 1) an | ||
// in-flight payment to this payment hash, or 2) a complete payment for | ||
// the same hash. | ||
s.pendingMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to hold the lock around ClearForTakeoff
? this limits it to single writer, and we lose out on all benefits of using Batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did this was to handle the case where ClearForTakeoff
is called concurrently with a HTLC for the same payment has gets settled. Without the lock we risk getting ErrPaymentInFlight
returned, but before we add the pending payment to the map the HTLC is settled.
I do see the lost benefit of batch... Time to bring out the multimutex
? 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used a multimutex, was the only way to really ensure the CircuitMap
and ControlTower
would always stay in sync.
htlcswitch/control_tower.go
Outdated
@@ -79,8 +88,12 @@ func NewPaymentControl(strict bool, db *channeldb.DB) ControlTower { | |||
|
|||
// ClearForTakeoff checks that we don't already have an InFlight or Completed | |||
// payment identified by the same payment hash. | |||
func (p *paymentControl) ClearForTakeoff(htlc *lnwire.UpdateAddHTLC) error { | |||
var takeoffErr error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
htlcswitch/switch.go
Outdated
amount: htlc.Amount, | ||
deobfuscator: deobfuscator, | ||
} | ||
s.pendingPayments[paymentID] = payment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if this overwrites an existing entry? seems the prior caller would never receive anything on their channels. do we need to support multiple subscribers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought here was only to handle the restart case, and otherwise make the caller responsible for not sending the HTLC more than once. But I definitely see the advantages of just handling duplicate entries, will see if can be easily handled!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and otherwise make the caller responsible for not sending the HTLC more than once
Isn't that hard to do if they have to call SendHTLC
to see if they've already sent/re-register for ntfns?
Can you describe the scenario we're trying to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went back to return results to all callers, so no overwriting would happen!
4b35216
to
8fa632c
Compare
htlcswitch/switch.go
Outdated
// the caller. | ||
s.wg.Add(1) | ||
go s.sendAsync( | ||
paymentID, firstHop, htlc, deobfuscator, preimageChan, errChan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am i understanding this right, in that we may reuse paymentID
if one was found in ClearForTakeoff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm we'd use the new one only if the payment was grounded which I think is okay
htlcswitch/control_tower.go
Outdated
takeoffErr = ErrAlreadyPaid | ||
// We've already completed a payment to this payment | ||
// hash, forbid the switch from sending another. | ||
preimage, err = channeldb.FetchPaymentPreimageTx( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the side-effects of this transaction are not idempotent, which means we may see invalid reads from any of these subcommands. specifically, pid
and preimage
should be set to zero values each time the function is invoked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't matter, since pid
is only used in case of ErrPaymentInFlight || nil
and preimage
only in case of ErrAlreadyPaid
. But probably doesn't hurt to do what you suggest.
htlcswitch/control_tower.go
Outdated
|
||
default: | ||
takeoffErr = ErrUnknownPaymentStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit should be reverted, and leave the batch error handling as is. If the function returns an error, then each will be run in isolation (again). This is the intent behind capturing the errors, but allowing the transaction to return nil
https://github.com/etcd-io/bbolt/blob/master/db.go#L787
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see TIL! Will revert and maybe add a comment explaining this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was kinda puzzled by why it was done this way tbh, probably should have asked before changing it :p
Discussed a bit offline with @cfromknecht, have some ideas on how to simplify this quite a bit, and will push an update in a few days. Stay tuned. |
31127d1
to
4aa084d
Compare
This is also true, but the new scenario joost is referring to is that we send the payment, then get used as an intermediary, restart, then report the payment succeeding even though we haven't heard back on the exact status of the payment.
Would be nice to find a way to avoid the redundant storage, while also retaining the strictness of the original proposal. |
@@ -1,4 +1,4 @@ | |||
package htlcswitch | |||
package routing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this code should really stay in the htlcswitch as it's properties are very intertwined with the operation of the circuit map. It is a helper object of the switch, which should be used by w/e packages need to interact with SendHTLC
@@ -504,12 +504,19 @@ func newServer(listenAddrs []net.Addr, chanDB *channeldb.DB, cc *chainControl, | |||
} | |||
s.currentNodeAnn = nodeAnn | |||
|
|||
// The router will get access to the payment ID sequencer, such that it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's not used by the switch directly, it is exposed so that users of the htlcswitch
package properly adhere to the semantics required by SendHTLC
.
It seems that what we want is the ability to distinguish between a previous attempt that completed and that we have the preimage in our cache for some other reason. This could be solved as in the original proposal with a persistent map We could reuse the To send a payment, we would get a new unique Sending the corresponding HTLC to the switch, it would commit the circuit, using the unique pid as key. Here we could add a check to the new " When the result comes back, we would tear down the circuit and store the preimage (in case of success) in the same db transaction. The outgoing payment would therefore be considered settled, since the preimage now is found in the bucket. In case the payment fails we could consider deleting the This approach is essentially what I did in an earlier branch (master...halseth:reliable-payments-reused-buckets), except for storing the preimages in their own bucket. That approach was awkward since the switch needed to be aware of the whole lmk what you think. tl;dr: We use the |
9f4c767
to
38a8610
Compare
htlcswitch/switch.go
Outdated
@@ -218,6 +219,7 @@ type Switch struct { | |||
// integer ID when it is created. | |||
pendingPayments map[uint64]*pendingPayment | |||
pendingMutex sync.RWMutex | |||
paymentIDMtx *multimutex.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the commits lookup preimage in cache
, ignore ErrDuplicateAdd
and use multimutex
should be squashed in a single commit.
The first two introduce functionality that doesn't really work reliable until the last is in place too. Checking out the intermediate commits doesn't give useful behaviour.
They are all small too and (at least to me) it would make review easier. To see a set of changes that together does something meaningful.
@@ -366,6 +371,20 @@ func (s *Switch) SendHTLC(firstHop lnwire.ShortChannelID, | |||
return zeroPreimage, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the control tower up to the router
level already in this pr. To finalize the required switch changes for reliable payments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Note that there still might be some switch changes left (most notably making SendHTLC
async), but I figured this PR was big enough as is now.
htlcswitch/switch.go
Outdated
// To ensure atomicity between checking the PreimageCache and | ||
// forwarding the HTLC (committing the circuit), we aquire a multimutex | ||
// for this paymentID. | ||
s.paymentIDMtx.Lock(paymentID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the multi mutex be avoided by looking up the preimage after commiting the circuit? If it is present then, the circuit could be teared down again. Or left in half open state, where it will be cleaned up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good idea, don!
Pasting my own notes used during review here for reference:
|
This commit moves the responsibility of generating a unique payment ID from the switch to the router. This will make it easier for the router to keep track of which HTLCs were successfully forwarded onto the network, as it can replay the same HTLCs as long as the paymentIDs are kept. The router is expected to maintain a map from paymentID->HTLC, such that they can be replayed on restart. This also lets the router check the status of a sent payment after a restart, simply by resending it.
This commit introduces a new method waitForPaymentResult, which handles the result of a payment sent on the network. This method encompasses a part of the logic that previously was located in handleLocalDispatch, such as error parsing. While doing this, we make the handleLocalDisaptch method write the result of the payment to the pending payment map, even if no pending payment is lingering, to handle the case where we after a restart would retrieve a payment result, before the pending payment was added to the map.
This commit extracts the logic from the forward-method into SendHTLC itself. This is done such that we later can inspect whether we already have a preimage before routing the packet onto the network.
As a step towards idempotent sends, we ignore any duplicate add message when forwarding the HTLC. After a restart the caller can resend the HTLC, and be sure that 1) if the payment is already in flight, we will go straight into waiting for a result comes back. 2) if this is the first time this HTLC is sent it will be routed onto the network as before. NOTE: The ControlTower currently doesn't allow us to re-send the HTLC, but this will later be allowed.
Give the switch access to the preimage cache, which we'll use to ensure we don't resend HTLCs that have already been settled.
As a step towards reliable tracking of preimages, we lookup the payment hash in the PreimageCache before attempting to route the HTLC onto the network. This lays the foundation for re-sends of the same HTLC, as the caller doesn't have to know about the result of the previous send. If the previous send is still in flight, we will get a DuplicateAdd error and go straight into waiting for the result. If the payment is not in flight and the preimage is not known the HTLC will be routed onto the network. If the payment did already succeed, the preimage will be in the cache, and we'll return early, avoiding routing the payment again. This is important to not forward a payment that already succeeded, since it might lead to loss of funds. NOTE: This assumes that the preimage is ALWAYS added to the PreimageCache before the circuit is torn down. NOTE: The ControlTower currently doesn't allow reapeated sends to the same hash, but this will later be allowed.
Since the only thing needed by the control tower to determine whether an HTLC is in flight is the payment hash, we take that instead of the whole HTLC.
This commit moves the check for in-flight payments to a given payment hash from the Switch to the Router. This is done as a preparation to let the router replay payments in flight after a restart, but currently the behavior is unchanged (since we don't persist active payments across restarts yet).
TestSwitchSendPaymentKnownPreimage checks that the switch immediately returns when we attempt to send a payment where the preimage is already found in the preimage cache. TestSwtichSendDuplicatePayment checks that if a payment is resent after a restart, it will still progress as if it was the first time it was sent. TestSwitchSendDuplicateSettledPayment makes sure the switch handle the case where it attempts to resend an HTLC after a restart, but concurrently a settle for this HTLC comes back. TestSwichSendDuplicateFailedPayment ensures that if we resend a payment that failed, the failure will be pending in case of a re-send.
…UniquePaymentID Documents the now new behaviour and expectations and SendHTLC and adds a test that exercises this behaviour. TestSwitchSendHTLCUniquePaymentID tests that the switch happily forwards the same HTLC as long as the paymentID used is unique.
38a8610
to
aa846b2
Compare
After discussion with @joostjager we came to a design that should be both simpler and easier to review. This is done by splitting up the |
htlcswitch/switch.go
Outdated
|
||
// Set the result packet and signal that to the callers it is ready. | ||
payment.result = pkt | ||
close(payment.ready) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this happen before teardown? To prevent this:
Teardown circuit
CommitCircuit (succeeds)
get or create pending payment in memory (create)
get or create pending payment in memory + signal result
async handoff to link if commit succeeded (pay twice)
wait for pending payment result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6ffe4eb
to
e07dc19
Compare
We wait for the circuit to be torn down before deleting the payment from the pending payment map. We do this to avoid a concurrent re-send of this pid being dropped after deleting the payment, causing it to never receive the result.
Blocked by #2501 |
Chatted with Johan offline, and I think we went down this rabbit hole due to one of my earlier comments that was a bit vague. We discussed a simpler version that hopefully should be easier to implement and also reason about during review as it involves much less shuffling around of the existing control flow and doesn't introduce any new dependancies. |
Closing in favor of #2762. |
This PR is a primary step towards fixing #2183
We handle a few problems regarding atomicity of payments within the switch. Since the writing of a payment hash to the
ControlTower
and switchCircuitMap
is not atomic, we risked these getting out of sync, and potentially leaving preimages in limbo. This PR attempts to fix a few of these issues by giving thehtlcswitch
the responsibility for storing the preimage a payment succeeds, and immediately returning them if we notice a replay of a payment. We do this by using theOutgoingPayment
bucket within the DB also for in-fligh payments (not only for completed payments as before).We ensure all outgoing payments are given a unique ID, since this is what is used to determine if a payment is already committed to the switch circuit.
To ensure the payment DB and
CircuitMap
stay in sync, we acquire a mutex tied to the particularpaymentID
before using them.This is also a step towards idempotent payments, since we'll now return a
pendingPayment
as normal even if the HTLC is already sent.TODO