Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelManager Payment Retries #1916

Conversation

valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Dec 13, 2022

Adds automatic payment retry logic to ChannelManager.

It may look like a big diff, but it's really just +50 in ChannelManager and +200 in outbound_payment (mostly copied from InvoicePayer), the rest is largely tests.

Planned follow-up PRs (roughly in order):

  1. Automatic retry methods for spontaneous payments
  2. Implement the new version of payment path scoring if we decide that users shouldn't handle that themselves from PaymentPathFailed events (and migrate relevant tests from InvoicePayer)
  3. Replace the InvoicePayer with individual invoice-paying utilities + migrate remaining InvoicePayer tests
  4. Adjust ChannelManager's various send_payment methods (i.e. we probably to rename send_payment_with_retry -> send_payment and send_payment -> send_payment_with_route, etc)
  5. Ensure we don't generate duplicate PendingHTLCsForwardable events, see ChannelManager Payment Retries #1916 (comment)

And add a new APIError::RouteNotFound at some point

TODOs for this PR:

  • migrate the remaining 10-15 tests that are relevant to this PR

Seeking feedback on the plan for follow-up PRs above^

Based on #1923 + #1812

@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2022

Codecov Report

Base: 90.71% // Head: 91.02% // Increases project coverage by +0.31% 🎉

Coverage data is based on head (f5af99a) compared to base (153b048).
Patch coverage: 93.51% of modified lines in pull request are covered.

❗ Current head f5af99a differs from pull request most recent head 6d81979. Consider uploading reports for the commit 6d81979 to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1916      +/-   ##
==========================================
+ Coverage   90.71%   91.02%   +0.31%     
==========================================
  Files          97       98       +1     
  Lines       50677    53579    +2902     
  Branches    50677    53579    +2902     
==========================================
+ Hits        45971    48772    +2801     
- Misses       4706     4807     +101     
Impacted Files Coverage Δ
lightning/src/ln/outbound_payment.rs 89.81% <90.29%> (+0.54%) ⬆️
lightning/src/ln/payment_tests.rs 97.24% <94.43%> (-1.50%) ⬇️
lightning/src/ln/channelmanager.rs 87.24% <100.00%> (+0.03%) ⬆️
lightning/src/util/test_utils.rs 73.93% <100.00%> (+1.88%) ⬆️
lightning/src/util/ser_macros.rs 85.58% <0.00%> (-1.45%) ⬇️
lightning/src/chain/onchaintx.rs 95.18% <0.00%> (-0.21%) ⬇️
lightning/src/ln/functional_tests.rs 96.94% <0.00%> (-0.18%) ⬇️
lightning/src/offers/merkle.rs 100.00% <0.00%> (ø)
lightning/src/offers/invoice.rs 94.79% <0.00%> (ø)
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch 6 times, most recently from 0bfa5ef to c7c1a5f Compare December 14, 2022 21:11
@jkczyz jkczyz self-requested a review December 15, 2022 18:07
@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from c7c1a5f to 979a30f Compare December 19, 2022 15:19
@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch 2 times, most recently from 3b2aee7 to 2b3b00f Compare December 21, 2022 17:20
@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from 2b3b00f to da266f2 Compare December 21, 2022 18:15
@valentinewallace
Copy link
Contributor Author

I think I finished adding the difficult tests. Still have 9 easier tests to go, but gonna mark this as ready for review anyway.

Also updated the PR description to set expectations for review

@valentinewallace valentinewallace marked this pull request as ready for review December 21, 2022 20:57
@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from b70ba29 to 82364b2 Compare December 21, 2022 21:03
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

@@ -6887,6 +6959,8 @@ impl<'a, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
hash_map::Entry::Vacant(entry) => {
let path_fee = path.get_path_fees();
entry.insert(PendingOutboundPayment::Retryable {
retry_strategy: Retry::Attempts(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some reasonable default? Retry::Attempts(0) would mean no retries. In what situations would we hit this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit this while deserializing ChannelManager if a monitor has some htlcs that the ChannelManager forgot about due to being behind the monitor on shutdown, IIRC. Idea is that we never want to retry payments after restart, because the payments are probably stale anyway

Copy link
Contributor

@jkczyz jkczyz Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we chose not to retry after restart in InvoicePayer because we didn't want to have something else for the user to serialize. But now that the retry logic is in ChannelManager, that rational is no longer applicable. We can still use the no-retry behavior for older serializations, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a consideration for follow-up, but it does seem like payments would be stale after restart? And even if they weren't, it seems like the sanest default would be to not retry 🤔 I think the invoice payer differed a bit since it has a static retry strategy config, which we don't have here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a consideration for follow-up, but it does seem like payments would be stale after restart? And even if they weren't, it seems like the sanest default would be to not retry 🤔

Restart might just be failing over to a new server, FWIW. Or maybe for mobile switching back and forth between apps?

I think we should at least decided what we'd like to happen to avoid unnecessary churn in the serialization format.

I think the invoice payer differed a bit since it has a static retry strategy config, which we don't have here

It was mostly because we didn't want to serialize it. If we serialized, the static strategy would have been serialized along with it, I'd imagine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, not gonna retry on startup due to some race conditions with abandon_payment (I'm forgetting the details..)

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch 2 times, most recently from 293862c to 271c257 Compare December 22, 2022 17:13
// TODO: do we need to generate a PendingHTLCsForwardable event?
if let Err(e) = self.retry_payment_auto_route(payment_id, params) {
log_trace!(self.logger, "Errored retrying payment, marking as abandoned: {:?}", e);
self.abandon_payment(payment_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheBlueMatt Are we moving to requiring users to call abandon_payment? You referenced a race condition with this method offline, could you elaborate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context was #1907

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch 2 times, most recently from cbea0a0 to aba3c4d Compare January 4, 2023 23:32
@valentinewallace
Copy link
Contributor Author

Finished migrating relevant tests from InvoicePayer and rebased on #1812.

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from aba3c4d to 601dfad Compare January 5, 2023 22:16
@valentinewallace
Copy link
Contributor Author

Rebased after #1812 merge.

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from d1445c4 to 7d7afec Compare January 24, 2023 04:01
lightning/src/ln/outbound_payment.rs Outdated Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Outdated Show resolved Hide resolved
@@ -752,8 +1021,15 @@ impl OutboundPayments {
// process_onion_failure we should close that channel as it implies our
// next-hop is needlessly blaming us!
if let Some(scid) = short_channel_id {
outbounds.get_mut(&payment_id).map(|pmt| pmt.insert_previously_failed_scid(scid));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh, we take the pending_outbound_payments lock at the top, then hold it while calling decode_onion_failure, which does iterative crypto and decoding. Its not new in this PR, but now you're adding an explicit dependency on the stuff calculated under the lock here. Worse, you're doing a duplicate outbounds lookup under the same lock that already did an outbounds lookup. Instead, can you (a) move the onion_error.decode_onion_failure and payment_is_probe calls outside the top outbounds lock, then you can do this update in the existing outbounds.entry block, and even better you can drop the outbounds lock before we create the events or take the pending_events lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem to make sense to move the payment_is_probe call, but otherwise did this.

I had considered this approach but thought it might be worth the extra outbounds lock to do some cheap checks before decoding the onion failure, but those cheap checks look pretty rare anyway

lightning/src/ln/outbound_payment.rs Outdated Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Outdated Show resolved Hide resolved
} else {
self.retry_payment_with_route(&route, payment_id, entropy_source, node_signer, best_block_height, send_payment_along_path)
};
match res {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its kinda confusing to match res twice, can we not do it in one match and mem::drop(outbounds) before calling pay? eg in the current code it looks like we spuriously increment_attempts in the failed_paths_retry.is_none() case, which the comment talks about being due to monitor updates being done async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, fixed this

Also configure it such that in std tests, it will use SinceEpoch instead of
Instant so time can be manually advanced.
lightning/src/ln/outbound_payment.rs Outdated Show resolved Hide resolved
Comment on lines 445 to 478
loop {
let mut outbounds = self.pending_outbound_payments.lock().unwrap();
let mut retry_id_route_params = None;
for (pmt_id, pmt) in outbounds.iter_mut() {
if pmt.is_retryable() {
if let PendingOutboundPayment::Retryable { pending_amt_msat, total_msat, route_params: Some(params), .. } = pmt {
if pending_amt_msat < total_msat {
retry_id_route_params = Some((*pmt_id, params.clone()));
}
}
}
if retry_id_route_params.is_some() { pmt.increment_attempts(); break }
}
if let Some((payment_id, route_params)) = retry_id_route_params {
core::mem::drop(outbounds);
if let Err(e) = self.pay_internal(payment_id, route_params, router, first_hops(), inflight_htlcs(), entropy_source, node_signer, best_block_height, &send_payment_along_path) {
log_trace!(logger, "Errored retrying payment: {:?}", e);
}
} else { break }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer loop and lock dropping makes this difficult to read. Isn't really clear to me why it is needed. Can't we simply loop through pending_outbound_payments once and call pay_internal as needed by passing a PendingOutboundPayment to it on each iteration?

Guessing it has something to do with locking in ChannelManager::send_payment_along_path, but I'm worried that with all the callbacks and locking this is becoming difficult to understand and maintain.

Copy link
Contributor Author

@valentinewallace valentinewallace Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't hold the outbounds lock going into pay_internal because then we'd be holding it during pathfinding. I agree it's weird, we'll also have n^2 runtime here relative to the size of outbounds.

Previously I gathered all the retryable outbounds into a vec under one lock --> dropped the lock --> retried them one by one, which was a bit cleaner IMO at the expense of an allocation. Change was requested here: #1916 (comment). Open to other ways to clean this up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why we wouldn't want to hold the lock during path finding, yeah. It doesn't seem the avoid the clones though? But no Vec allocation, I suppose.

@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch 2 times, most recently from 169bb10 to f5af99a Compare January 24, 2023 20:40
@TheBlueMatt
Copy link
Collaborator

Feel free to squash, IMO.

Used in upcoming commit(s) to automatically retry HTLCs in ChannelManager
It's not ideal to do all this computation while the lock is held. We also want
to decode the failure *before* taking the lock, so we can store the failed scid
in the relevant outbound for retry in the next commit(s).
…stead

This makes it easier to retry payments if all paths fail on initial send, in
in which case we'll want to hold off on removing the pending payment
As part of migrating payment retries from InvoicePayer to ChannelManager,
several tests don't need a rewrite and can be pretty much copied and pasted.
@valentinewallace valentinewallace force-pushed the 2022-11-chanman-payment-retries branch from f5af99a to 6d81979 Compare January 25, 2023 19:44
@valentinewallace
Copy link
Contributor Author

Squashed

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, all the comments should be addressed but it can/should happen in a followup, the layout here is good.

lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking, so feel free to leave anything to a follow-up.

lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
lightning/src/ln/outbound_payment.rs Show resolved Hide resolved
Comment on lines +381 to +390
pub(super) fn send_payment<R: Deref, ES: Deref, NS: Deref, F>(
&self, payment_hash: PaymentHash, payment_secret: &Option<PaymentSecret>, payment_id: PaymentId,
retry_strategy: Retry, route_params: RouteParameters, router: &R,
first_hops: Vec<ChannelDetails>, inflight_htlcs: InFlightHtlcs, entropy_source: &ES,
node_signer: &NS, best_block_height: u32, send_payment_along_path: F
) -> Result<(), PaymentSendFailure>
where
R::Target: Router,
ES::Target: EntropySource,
NS::Target: NodeSigner,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use references to a Deref. Just use Deref and update the call sites to do &* if needed. Then the tests don't need &&.

Likewise elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't able to get this working. Let me know if you have a patch that works.

lightning/src/util/test_utils.rs Show resolved Hide resolved
};
let err = if on_retry {
outbound_payments.add_new_pending_payment(PaymentHash([0; 32]), None, PaymentId([0; 32]),
&Route { paths: vec![], payment_params: None }, Retry::Attempts(1), Some(route_params.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation looks off.

@valentinewallace valentinewallace merged commit e2beaef into lightningdevkit:main Jan 26, 2023
k0k0ne pushed a commit to bitlightlabs/rust-lightning that referenced this pull request Sep 30, 2024
0.0.114 - Mar 3, 2023 - "Faster Async BOLT12 Retries"

API Updates
===========

 * `InvoicePayer` has been removed and its features moved directly into
   `ChannelManager`. As such it now requires a simplified `Router` and supports
   `send_payment_with_retry` (and friends). `ChannelManager::retry_payment` was
   removed in favor of the automated retries. Invoice payment utilities in
   `lightning-invoice` now call the new code (lightningdevkit#1812, lightningdevkit#1916, lightningdevkit#1929, lightningdevkit#2007, etc).
 * `Sign`/`BaseSign` has been renamed `ChannelSigner`, with `EcdsaChannelSigner`
   split out in anticipation of future schnorr/taproot support (lightningdevkit#1967).
 * The catch-all `KeysInterface` was split into `EntropySource`, `NodeSigner`,
   and `SignerProvider`. `KeysManager` implements all three (lightningdevkit#1910, lightningdevkit#1930).
 * `KeysInterface::get_node_secret` is now `KeysManager::get_node_secret_key`
   and is no longer required for external signers (lightningdevkit#1951, lightningdevkit#2070).
 * A `lightning-transaction-sync` crate has been added which implements keeping
   LDK in sync with the chain via an esplora server (lightningdevkit#1870). Note that it can
   only be used on nodes that *never* ran a previous version of LDK.
 * `Score` is updated in `BackgroundProcessor` instead of via `Router` (lightningdevkit#1996).
 * `ChainAccess::get_utxo` (now `UtxoAccess`) can now be resolved async (lightningdevkit#1980).
 * BOLT12 `Offer`, `InvoiceRequest`, `Invoice` and `Refund` structs as well as
   associated builders have been added. Such invoices cannot yet be paid due to
   missing support for blinded path payments (lightningdevkit#1927, lightningdevkit#1908, lightningdevkit#1926).
 * A `lightning-custom-message` crate has been added to make combining multiple
   custom messages into one enum/handler easier (lightningdevkit#1832).
 * `Event::PaymentPathFailure` is now generated for failure to send an HTLC
   over the first hop on our local channel (lightningdevkit#2014, lightningdevkit#2043).
 * `lightning-net-tokio` no longer requires an `Arc` on `PeerManager` (lightningdevkit#1968).
 * `ChannelManager::list_recent_payments` was added (lightningdevkit#1873).
 * `lightning-background-processor` `std` is now optional in async mode (lightningdevkit#1962).
 * `create_phantom_invoice` can now be used in `no-std` (lightningdevkit#1985).
 * The required final CLTV delta on inbound payments is now configurable (lightningdevkit#1878)
 * bitcoind RPC error code and message are now surfaced in `block-sync` (lightningdevkit#2057).
 * Get `historical_estimated_channel_liquidity_probabilities` was added (lightningdevkit#1961).
 * `ChannelManager::fail_htlc_backwards_with_reason` was added (lightningdevkit#1948).
 * Macros which implement serialization using TLVs or straight writing of struct
   fields are now public (lightningdevkit#1823, lightningdevkit#1976, lightningdevkit#1977).

Backwards Compatibility
=======================

 * Any inbound payments with a custom final CLTV delta will be rejected by LDK
   if you downgrade prior to receipt (lightningdevkit#1878).
 * `Event::PaymentPathFailed::network_update` will always be `None` if an
   0.0.114-generated event is read by a prior version of LDK (lightningdevkit#2043).
 * `Event::PaymentPathFailed::all_paths_removed` will always be false if an
   0.0.114-generated event is read by a prior version of LDK. Users who rely on
   it to determine payment retries should migrate to `Event::PaymentFailed`, in
   a separate release prior to upgrading to LDK 0.0.114 if downgrading is
   supported (lightningdevkit#2043).

Performance Improvements
========================

 * Channel data is now stored per-peer and channel updates across multiple
   peers can be operated on simultaneously (lightningdevkit#1507).
 * Routefinding is roughly 1.5x faster (lightningdevkit#1799).
 * Deserializing a `NetworkGraph` is roughly 6x faster (lightningdevkit#2016).
 * Memory usage for a `NetworkGraph` has been reduced substantially (lightningdevkit#2040).
 * `KeysInterface::get_secure_random_bytes` is roughly 200x faster (lightningdevkit#1974).

Bug Fixes
=========

 * Fixed a bug where a delay in processing a `PaymentSent` event longer than the
   time taken to persist a `ChannelMonitor` update, when occurring immediately
   prior to a crash, may result in the `PaymentSent` event being lost (lightningdevkit#2048).
 * Fixed spurious rejections of rapid gossip sync data when the graph has been
   updated by other means between gossip syncs (lightningdevkit#2046).
 * Fixed a panic in `KeysManager` when the high bit of `starting_time_nanos`
   is set (lightningdevkit#1935).
 * Resolved an issue where the `ChannelManager::get_persistable_update_future`
   future would fail to wake until a second notification occurs (lightningdevkit#2064).
 * Resolved a memory leak when using `ChannelManager::send_probe` (lightningdevkit#2037).
 * Fixed a deadlock on some platforms at least when using async `ChannelMonitor`
   updating (lightningdevkit#2006).
 * Removed debug-only assertions which were reachable in threaded code (lightningdevkit#1964).
 * In some cases when payment sending fails on our local channel retries no
   longer take the same path and thus never succeed (lightningdevkit#2014).
 * Retries for spontaneous payments have been fixed (lightningdevkit#2002).
 * Return an `Err` if `lightning-persister` fails to read the directory listing
   rather than panicing (lightningdevkit#1943).
 * `peer_disconnected` will now never be called without `peer_connected` (lightningdevkit#2035)

Security
========

0.0.114 fixes several denial-of-service vulnerabilities which are reachable from
untrusted input from channel counterparties or in deployments accepting inbound
connections or channels. It also fixes a denial-of-service vulnerability in rare
cases in the route finding logic.
 * The number of pending un-funded channels as well as peers without funded
   channels is now limited to avoid denial of service (lightningdevkit#1988).
 * A second `channel_ready` message received immediately after the first could
   lead to a spurious panic (lightningdevkit#2071). This issue was introduced with 0conf
   support in LDK 0.0.107.
 * A division-by-zero issue was fixed in the `ProbabilisticScorer` if the amount
   being sent (including previous-hop fees) is equal to a channel's capacity
   while walking the graph (lightningdevkit#2072). The division-by-zero was introduced with
   historical data tracking in LDK 0.0.112.

In total, this release features 130 files changed, 21457 insertions, 10113
deletions in 343 commits from 18 authors, in alphabetical order:
 * Alec Chen
 * Allan Douglas R. de Oliveira
 * Andrei
 * Arik Sosman
 * Daniel Granhão
 * Duncan Dean
 * Elias Rohrer
 * Jeffrey Czyz
 * John Cantrell
 * Kurtsley
 * Matt Corallo
 * Max Fang
 * Omer Yacine
 * Valentine Wallace
 * Viktor Tigerström
 * Wilmer Paulino
 * benthecarman
 * jurvis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants