-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reliable payments] persist htlcswitch pending payments #2762
[reliable payments] persist htlcswitch pending payments #2762
Conversation
c22bf88
to
298a129
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this new approach much easier to follow than the prior iterations! I'm not sure we can get rid of the control tower as is though, due to the possibility of a user initiating a payment, then updating to this new version. Worst case they would re-try with the same payment hash and possibly lose that payment amount.
First pass review completed, will need some additional unit tests for the new biz logic in the pending payment store, and also the new functionality w.r.t passing back the same payment ID, etc.
htlcswitch/switch.go
Outdated
// network. By deleting those pending payments that are not found in | ||
// the circuit map, we ensure that the router won't wait for a result | ||
// to be returned for this payment ID. | ||
// NOTE: this assumes that the circuit map has already been trimmed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this last assumption is sound with the current implementation. IIRC, the circuit map as is, can contain circuits for payments they have either already been completed or failed on the network.
cc @cfromknecht
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review
htlcswitch/pending_payment.go
Outdated
// fetchPendingPayments retrieves all payments from the store that are pending, | ||
// meaning no results are available yet, and can be re-forwarded on the | ||
// network. | ||
func (store *pendingPaymentStore) fetchPendingPayments() ([]*PendingPayment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is test only code. Would look for a different way to assert behaviour (intercepting mock?)
htlcswitch/pending_payment.go
Outdated
|
||
// getPayment is used to query the store for a pending payment for the given | ||
// payment ID. If the pid is not found, ErrPaymentIDNotFound will be returned. | ||
func (store *pendingPaymentStore) getPayment(pid uint64) (*PendingPayment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used?
htlcswitch/pending_payment.go
Outdated
// syncWithCircuitMap should be called at startup to delete the pending | ||
// payments which are no longer found in the circuit map. This removes pending | ||
// payments that was not successfully forwarded to the link before shutdown. | ||
func (store *pendingPaymentStore) syncWithCircuitMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to do this anyway, what is the use of persisting pending payments? Can't that list be reconstructed on startup from the circuit map?
So only persist the outcomes when they become available.
htlcswitch/pending_payment.go
Outdated
|
||
// completePayment stores the PaymentResult for the given paymentID, and | ||
// notifies any subscribers. | ||
func (store *pendingPaymentStore) completePayment(paymentID uint64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those results left in forever?
298a129
to
cb790c4
Compare
aea2217
to
858ffd8
Compare
htlcswitch/switch.go
Outdated
// If the provided deobfuscator is nil, we have discarded the error | ||
// decryptor due to a restart. We'll return a fixed error and signal a | ||
// temporary channel failure to the router. | ||
case deobfuscator == nil: | ||
userErr := fmt.Sprintf("error decryptor for payment " + | ||
"could not be located, likely due to restart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this still possible after this PR?
39263d4
to
e24ded1
Compare
@@ -0,0 +1,268 @@ | |||
package htlcswitch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename file to payment_result_store.go
, as we do with other stores too?
htlcswitch/payment_result.go
Outdated
|
||
// Otherwise, write 'true' and the error fields. | ||
msg := []byte(p.Error.ExtraMsg) | ||
err := channeldb.WriteElements(w, true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving the human readable error ExtraMsg
to the db is not ideal. There are only a few cases for ExtraMsg
, those can probably be saved in a structured way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further inspection, there may be a few more. But still doubt whether we should persist data just to get a nice error message. Maybe just logging when the error occurs is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like they are only used locally for passing the case of error to the RPC. Would it be useful to define a field for local error codes
that can be used to communicate local failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that local internal errors (things that should never happen) don't need to be exposed over rpc. Logging and returning a generic internal error should be enough?
htlcswitch/payment_result.go
Outdated
// Otherwise, write 'true' and the error fields. | ||
msg := []byte(p.Error.ExtraMsg) | ||
err := channeldb.WriteElements(w, true, | ||
p.Error.ErrorSource, msg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrorSource
could be nil
if the error source is unknown (for example after restart or when the onion decrypt fails). The store should probably prepare for that, to prevent a db migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using nil
can leads to problems down the line, what do you think about defining a constant UnknownPaymentSource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of problems? nil
for unset fields is quite common? For serialize to the db we can of course write anything we want.
7c2f8d7
to
8294a42
Compare
0975df7
to
28401bb
Compare
7195371
to
e9c25ae
Compare
Rebased. |
First commit can be removed during the next rebase. |
var paymentIDBytes [8]byte | ||
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID) | ||
|
||
err := store.db.Batch(func(tx *bbolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to consider making this an update instead, if only for the sake of the integration tests. In the past, then extra 30ms or so jitter here really accumulated making many tests time out when we tried Batch
ing everywhere before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm def in favor of using Update
as long as we don't have any clear indication that it is a performance bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do decide to use Update
, then we probably don't need a multimutex since the db txns can't execute concurrently. storeResult
is called in separate goroutines, so under heavy load i suspect Batch
would help, but would need to benchmark to be certain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still using a Update
here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.
e9c25ae
to
bc77400
Compare
htlcswitch/switch.go
Outdated
@@ -989,6 +976,7 @@ func (s *Switch) parseFailedPayment(deobfuscator ErrorDecrypter, | |||
// the first hop. In this case, we'll report a permanent | |||
// channel failure as this means us, or the remote party had to | |||
// go on chain. | |||
// TODO: check reason length instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is meant by this/when do we plan to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
t.Fatalf("unable to get payment result: %v", err) | ||
} | ||
|
||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking, but this segment is an exact copy of the logic above, maybe add a checkResult
closure?
Really happy with how this refactor turned out, this method of storing the payment results i think is much cleaner than some of the avenues we explored earlier! Will need rebase now that #3087 has landed |
lnd_test.go
Outdated
} | ||
|
||
// Wait for all the invoices to reach the OPEN state. | ||
for _, stream := range invoiceStreams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point these helper routines for hodl invoices could be extracted, but not blocker on this PR
msg: pkt.htlc, | ||
unencrypted: unencrypted, | ||
isResolution: pkt.isResolution, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timestamp to record when failure was received. This is one building block of black hole defense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably time.UnixNano
for sub-second resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.
In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.
Can now be updated to use the latest RPC calls in the new sub-server for the new itests. |
bc77400
to
88baef2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ☄️
Two non-blocking follow up that I'd like to see extended as a PR soon after this lands:
- Additional itest for the payment failure case
- Clean up of the payment result once that router updates its state, and ACKs the message from the switch.
var paymentIDBytes [8]byte | ||
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID) | ||
|
||
err := store.db.Batch(func(tx *bbolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still using a Update
here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.
msg: pkt.htlc, | ||
unencrypted: unencrypted, | ||
isResolution: pkt.isResolution, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.
In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🔥
b4ea245
to
78dfb83
Compare
paymentResultStore is a persistent store where we keep track of all received payment results. This is used to ensure we don't lose results from payment attempts on restarts.
Used for logging in the switch, and when we remove the pending payments, only the router will have the hash stored across restarts.
testHoldInvoicePersistence tests that a sender to a hold-invoice, can be restarted before the payment gets settled, and still be able to receive the preimage.
TestNetworkResultStore tests that the networkResult store behaves as expected, and that we can store, get and subscribe to results.
TestSwitchGetPaymentResult tests that the switch interacts as expected with the circuit map and network result store when looking up the result of a payment ID. This is important for not to lose results under concurrent lookup and receiving results.
78dfb83
to
dd88015
Compare
Alright, itest using the new payment API is pushed. PTAL @joostjager |
This PR is a follow-up from #2761. It adds a persistent
pendingPaymentStore
to theSwitch
, needed to store payment results when therouter
is not available to handle them.Problem
After a restart, there is no connection between an HTLC in flight on the network and the
router
payment flow that initially sent the HTLC. Therouter
will eventually callGetPaymentResult
to retrieve the result of the HTLC, but we have no guarantees this will be done before the result is received. This can lead to the result getting dropped, which is not ideal.Solution
We introduce a
pendingPayementStore
which has two primary functions:PaymentResult
s when they are received, regardless of whether there are any subscribers (theChannelRouter
) to the corresponding HTLC. This ensures that therouter
gets handed the result when it callsGetPaymentResult
.chanID
andpaymentID
, used to identify the payment in the circuit map.CircuitMap
. We use the circuit map as the source of truth whether a payment is in flight on the network. Syncing it at startup, before we start handling HTLCs, is necessary to be able to decide whether an HTLC was successfully forwarded.Builds on #2761
Replaces #2265