Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reliable payments] persist htlcswitch pending payments #2762

Merged

Conversation

halseth
Copy link
Contributor

@halseth halseth commented Mar 12, 2019

This PR is a follow-up from #2761. It adds a persistent pendingPaymentStore to the Switch, needed to store payment results when the router is not available to handle them.

Problem

After a restart, there is no connection between an HTLC in flight on the network and the router payment flow that initially sent the HTLC. The router will eventually call GetPaymentResult to retrieve the result of the HTLC, but we have no guarantees this will be done before the result is received. This can lead to the result getting dropped, which is not ideal.

Solution

We introduce a pendingPayementStore which has two primary functions:

  1. It stores PaymentResults when they are received, regardless of whether there are any subscribers (the ChannelRouter) to the corresponding HTLC. This ensures that the router gets handed the result when it calls GetPaymentResult.
  2. It stores the information necessary to identify and handle a result that comes back after a restart. This is the chanID and paymentID, used to identify the payment in the circuit map.
  3. On restarts it syncs the pending payments with the CircuitMap. We use the circuit map as the source of truth whether a payment is in flight on the network. Syncing it at startup, before we start handling HTLCs, is necessary to be able to decide whether an HTLC was successfully forwarded.

Builds on #2761
Replaces #2265

@halseth halseth added this to the 0.6 milestone Mar 12, 2019
@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch 6 times, most recently from c22bf88 to 298a129 Compare March 12, 2019 14:39
channeldb/payments.go Outdated Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this new approach much easier to follow than the prior iterations! I'm not sure we can get rid of the control tower as is though, due to the possibility of a user initiating a payment, then updating to this new version. Worst case they would re-try with the same payment hash and possibly lose that payment amount.

First pass review completed, will need some additional unit tests for the new biz logic in the pending payment store, and also the new functionality w.r.t passing back the same payment ID, etc.

htlcswitch/switch.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
// network. By deleting those pending payments that are not found in
// the circuit map, we ensure that the router won't wait for a result
// to be returned for this payment ID.
// NOTE: this assumes that the circuit map has already been trimmed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this last assumption is sound with the current implementation. IIRC, the circuit map as is, can contain circuits for payments they have either already been completed or failed on the network.

cc @cfromknecht

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review

// fetchPendingPayments retrieves all payments from the store that are pending,
// meaning no results are available yet, and can be re-forwarded on the
// network.
func (store *pendingPaymentStore) fetchPendingPayments() ([]*PendingPayment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is test only code. Would look for a different way to assert behaviour (intercepting mock?)


// getPayment is used to query the store for a pending payment for the given
// payment ID. If the pid is not found, ErrPaymentIDNotFound will be returned.
func (store *pendingPaymentStore) getPayment(pid uint64) (*PendingPayment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used?

// syncWithCircuitMap should be called at startup to delete the pending
// payments which are no longer found in the circuit map. This removes pending
// payments that was not successfully forwarded to the link before shutdown.
func (store *pendingPaymentStore) syncWithCircuitMap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to do this anyway, what is the use of persisting pending payments? Can't that list be reconstructed on startup from the circuit map?

So only persist the outcomes when they become available.


// completePayment stores the PaymentResult for the given paymentID, and
// notifies any subscribers.
func (store *pendingPaymentStore) completePayment(paymentID uint64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those results left in forever?

@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch from 298a129 to cb790c4 Compare March 20, 2019 13:26
@Roasbeef Roasbeef removed this from the 0.6 milestone Mar 26, 2019
@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch 3 times, most recently from aea2217 to 858ffd8 Compare April 9, 2019 09:30
@cfromknecht cfromknecht added this to the 0.7 milestone Apr 11, 2019
// If the provided deobfuscator is nil, we have discarded the error
// decryptor due to a restart. We'll return a fixed error and signal a
// temporary channel failure to the router.
case deobfuscator == nil:
userErr := fmt.Sprintf("error decryptor for payment " +
"could not be located, likely due to restart")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this still possible after this PR?

@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch 2 times, most recently from 39263d4 to e24ded1 Compare May 6, 2019 10:19
@@ -0,0 +1,268 @@
package htlcswitch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename file to payment_result_store.go, as we do with other stores too?


// Otherwise, write 'true' and the error fields.
msg := []byte(p.Error.ExtraMsg)
err := channeldb.WriteElements(w, true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving the human readable error ExtraMsg to the db is not ideal. There are only a few cases for ExtraMsg, those can probably be saved in a structured way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further inspection, there may be a few more. But still doubt whether we should persist data just to get a nice error message. Maybe just logging when the error occurs is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like they are only used locally for passing the case of error to the RPC. Would it be useful to define a field for local error codes that can be used to communicate local failures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that local internal errors (things that should never happen) don't need to be exposed over rpc. Logging and returning a generic internal error should be enough?

// Otherwise, write 'true' and the error fields.
msg := []byte(p.Error.ExtraMsg)
err := channeldb.WriteElements(w, true,
p.Error.ErrorSource, msg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrorSource could be nil if the error source is unknown (for example after restart or when the onion decrypt fails). The store should probably prepare for that, to prevent a db migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using nil can leads to problems down the line, what do you think about defining a constant UnknownPaymentSource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of problems? nil for unset fields is quite common? For serialize to the db we can of course write anything we want.

@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch 2 times, most recently from 7c2f8d7 to 8294a42 Compare May 6, 2019 12:31
htlcswitch/payment_result.go Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch 4 times, most recently from 0975df7 to 28401bb Compare May 9, 2019 11:49
@halseth
Copy link
Contributor Author

halseth commented May 28, 2019

Rebased.

@Roasbeef
Copy link
Member

First commit can be removed during the next rebase.

htlcswitch/payment_result.go Show resolved Hide resolved
htlcswitch/payment_result.go Show resolved Hide resolved
htlcswitch/payment_result.go Show resolved Hide resolved
var paymentIDBytes [8]byte
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID)

err := store.db.Batch(func(tx *bbolt.Tx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to consider making this an update instead, if only for the sake of the integration tests. In the past, then extra 30ms or so jitter here really accumulated making many tests time out when we tried Batching everywhere before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm def in favor of using Update as long as we don't have any clear indication that it is a performance bottleneck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do decide to use Update, then we probably don't need a multimutex since the db txns can't execute concurrently. storeResult is called in separate goroutines, so under heavy load i suspect Batch would help, but would need to benchmark to be certain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.

htlcswitch/switch.go Show resolved Hide resolved
@@ -989,6 +976,7 @@ func (s *Switch) parseFailedPayment(deobfuscator ErrorDecrypter,
// the first hop. In this case, we'll report a permanent
// channel failure as this means us, or the remote party had to
// go on chain.
// TODO: check reason length instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is meant by this/when do we plan to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

t.Fatalf("unable to get payment result: %v", err)
}

select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking, but this segment is an exact copy of the logic above, maybe add a checkResult closure?

@cfromknecht
Copy link
Contributor

Really happy with how this refactor turned out, this method of storing the payment results i think is much cleaner than some of the avenues we explored earlier! Will need rebase now that #3087 has landed

lnd_test.go Outdated
}

// Wait for all the invoices to reach the OPEN state.
for _, stream := range invoiceStreams {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point these helper routines for hodl invoices could be extracted, but not blocker on this PR

msg: pkt.htlc,
unencrypted: unencrypted,
isResolution: pkt.isResolution,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add timestamp to record when failure was received. This is one building block of black hole defense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably time.UnixNano for sub-second resolution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.

In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.

@Roasbeef
Copy link
Member

Roasbeef commented Jun 6, 2019

Can now be updated to use the latest RPC calls in the new sub-server for the new itests.

@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch from bc77400 to 88baef2 Compare June 6, 2019 15:01
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ☄️

Two non-blocking follow up that I'd like to see extended as a PR soon after this lands:

  1. Additional itest for the payment failure case
  2. Clean up of the payment result once that router updates its state, and ACKs the message from the switch.

var paymentIDBytes [8]byte
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID)

err := store.db.Batch(func(tx *bbolt.Tx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.

msg: pkt.htlc,
unencrypted: unencrypted,
isResolution: pkt.isResolution,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.

In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.

Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🔥

@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch from b4ea245 to 78dfb83 Compare June 7, 2019 14:42
paymentResultStore is a persistent store where we keep track of all
received payment results. This is used to ensure we don't lose results
from payment attempts on restarts.
Used for logging in the switch, and when we remove the pending payments,
only the router will have the hash stored across restarts.
testHoldInvoicePersistence tests that a sender to a hold-invoice, can be
restarted before the payment gets settled, and still be able to receive
the preimage.
TestNetworkResultStore tests that the networkResult store behaves as
expected, and that we can store, get and subscribe to results.
TestSwitchGetPaymentResult tests that the switch interacts as expected
with the circuit map and network result store when looking up the result
of a payment ID. This is important for not to lose results under
concurrent lookup and receiving results.
@halseth halseth force-pushed the reliable-payments-lookup-circuitmap branch from 78dfb83 to dd88015 Compare June 7, 2019 14:53
@halseth
Copy link
Contributor Author

halseth commented Jun 7, 2019

Alright, itest using the new payment API is pushed. PTAL @joostjager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants