-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChannelManager
Payment Retries
#1916
ChannelManager
Payment Retries
#1916
Conversation
Codecov ReportBase: 90.71% // Head: 91.02% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #1916 +/- ##
==========================================
+ Coverage 90.71% 91.02% +0.31%
==========================================
Files 97 98 +1
Lines 50677 53579 +2902
Branches 50677 53579 +2902
==========================================
+ Hits 45971 48772 +2801
- Misses 4706 4807 +101
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
0bfa5ef
to
c7c1a5f
Compare
c7c1a5f
to
979a30f
Compare
3b2aee7
to
2b3b00f
Compare
2b3b00f
to
da266f2
Compare
I think I finished adding the difficult tests. Still have 9 easier tests to go, but gonna mark this as ready for review anyway. Also updated the PR description to set expectations for review |
b70ba29
to
82364b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
@@ -6887,6 +6959,8 @@ impl<'a, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> | |||
hash_map::Entry::Vacant(entry) => { | |||
let path_fee = path.get_path_fees(); | |||
entry.insert(PendingOutboundPayment::Retryable { | |||
retry_strategy: Retry::Attempts(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some reasonable default? Retry::Attempts(0)
would mean no retries. In what situations would we hit this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hit this while deserializing ChannelManager
if a monitor has some htlcs that the ChannelManager
forgot about due to being behind the monitor on shutdown, IIRC. Idea is that we never want to retry payments after restart, because the payments are probably stale anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we chose not to retry after restart in InvoicePayer
because we didn't want to have something else for the user to serialize. But now that the retry logic is in ChannelManager
, that rational is no longer applicable. We can still use the no-retry behavior for older serializations, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a consideration for follow-up, but it does seem like payments would be stale after restart? And even if they weren't, it seems like the sanest default would be to not retry 🤔 I think the invoice payer differed a bit since it has a static retry strategy config, which we don't have here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a consideration for follow-up, but it does seem like payments would be stale after restart? And even if they weren't, it seems like the sanest default would be to not retry 🤔
Restart might just be failing over to a new server, FWIW. Or maybe for mobile switching back and forth between apps?
I think we should at least decided what we'd like to happen to avoid unnecessary churn in the serialization format.
I think the invoice payer differed a bit since it has a static retry strategy config, which we don't have here
It was mostly because we didn't want to serialize it. If we serialized, the static strategy would have been serialized along with it, I'd imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, not gonna retry on startup due to some race conditions with abandon_payment
(I'm forgetting the details..)
293862c
to
271c257
Compare
lightning/src/ln/channelmanager.rs
Outdated
// TODO: do we need to generate a PendingHTLCsForwardable event? | ||
if let Err(e) = self.retry_payment_auto_route(payment_id, params) { | ||
log_trace!(self.logger, "Errored retrying payment, marking as abandoned: {:?}", e); | ||
self.abandon_payment(payment_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheBlueMatt Are we moving to requiring users to call abandon_payment
? You referenced a race condition with this method offline, could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context was #1907
cbea0a0
to
aba3c4d
Compare
Finished migrating relevant tests from |
aba3c4d
to
601dfad
Compare
Rebased after #1812 merge. |
601dfad
to
c6d2a54
Compare
d1445c4
to
7d7afec
Compare
lightning/src/ln/outbound_payment.rs
Outdated
@@ -752,8 +1021,15 @@ impl OutboundPayments { | |||
// process_onion_failure we should close that channel as it implies our | |||
// next-hop is needlessly blaming us! | |||
if let Some(scid) = short_channel_id { | |||
outbounds.get_mut(&payment_id).map(|pmt| pmt.insert_previously_failed_scid(scid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bleh, we take the pending_outbound_payments
lock at the top, then hold it while calling decode_onion_failure
, which does iterative crypto and decoding. Its not new in this PR, but now you're adding an explicit dependency on the stuff calculated under the lock here. Worse, you're doing a duplicate outbounds
lookup under the same lock that already did an outbounds
lookup. Instead, can you (a) move the onion_error.decode_onion_failure
and payment_is_probe
calls outside the top outbounds
lock, then you can do this update in the existing outbounds.entry
block, and even better you can drop the outbounds
lock before we create the events or take the pending_events
lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't seem to make sense to move the payment_is_probe
call, but otherwise did this.
I had considered this approach but thought it might be worth the extra outbounds
lock to do some cheap checks before decoding the onion failure, but those cheap checks look pretty rare anyway
} else { | ||
self.retry_payment_with_route(&route, payment_id, entropy_source, node_signer, best_block_height, send_payment_along_path) | ||
}; | ||
match res { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its kinda confusing to match res twice, can we not do it in one match and mem::drop(outbounds)
before calling pay? eg in the current code it looks like we spuriously increment_attempts
in the failed_paths_retry.is_none()
case, which the comment talks about being due to monitor updates being done async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, fixed this
Also configure it such that in std tests, it will use SinceEpoch instead of Instant so time can be manually advanced.
lightning/src/ln/outbound_payment.rs
Outdated
loop { | ||
let mut outbounds = self.pending_outbound_payments.lock().unwrap(); | ||
let mut retry_id_route_params = None; | ||
for (pmt_id, pmt) in outbounds.iter_mut() { | ||
if pmt.is_retryable() { | ||
if let PendingOutboundPayment::Retryable { pending_amt_msat, total_msat, route_params: Some(params), .. } = pmt { | ||
if pending_amt_msat < total_msat { | ||
retry_id_route_params = Some((*pmt_id, params.clone())); | ||
} | ||
} | ||
} | ||
if retry_id_route_params.is_some() { pmt.increment_attempts(); break } | ||
} | ||
if let Some((payment_id, route_params)) = retry_id_route_params { | ||
core::mem::drop(outbounds); | ||
if let Err(e) = self.pay_internal(payment_id, route_params, router, first_hops(), inflight_htlcs(), entropy_source, node_signer, best_block_height, &send_payment_along_path) { | ||
log_trace!(logger, "Errored retrying payment: {:?}", e); | ||
} | ||
} else { break } | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer loop and lock dropping makes this difficult to read. Isn't really clear to me why it is needed. Can't we simply loop through pending_outbound_payments
once and call pay_internal
as needed by passing a PendingOutboundPayment
to it on each iteration?
Guessing it has something to do with locking in ChannelManager::send_payment_along_path
, but I'm worried that with all the callbacks and locking this is becoming difficult to understand and maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't hold the outbounds
lock going into pay_internal
because then we'd be holding it during pathfinding. I agree it's weird, we'll also have n^2 runtime here relative to the size of outbounds
.
Previously I gathered all the retryable outbounds into a vec under one lock --> dropped the lock --> retried them one by one, which was a bit cleaner IMO at the expense of an allocation. Change was requested here: #1916 (comment). Open to other ways to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why we wouldn't want to hold the lock during path finding, yeah. It doesn't seem the avoid the clones though? But no Vec
allocation, I suppose.
169bb10
to
f5af99a
Compare
Feel free to squash, IMO. |
Used in upcoming commit(s) to automatically retry HTLCs in ChannelManager
It's not ideal to do all this computation while the lock is held. We also want to decode the failure *before* taking the lock, so we can store the failed scid in the relevant outbound for retry in the next commit(s).
…stead This makes it easier to retry payments if all paths fail on initial send, in in which case we'll want to hold off on removing the pending payment
As part of migrating payment retries from InvoicePayer to ChannelManager, several tests don't need a rewrite and can be pretty much copied and pasted.
f5af99a
to
6d81979
Compare
Squashed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, all the comments should be addressed but it can/should happen in a followup, the layout here is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking, so feel free to leave anything to a follow-up.
pub(super) fn send_payment<R: Deref, ES: Deref, NS: Deref, F>( | ||
&self, payment_hash: PaymentHash, payment_secret: &Option<PaymentSecret>, payment_id: PaymentId, | ||
retry_strategy: Retry, route_params: RouteParameters, router: &R, | ||
first_hops: Vec<ChannelDetails>, inflight_htlcs: InFlightHtlcs, entropy_source: &ES, | ||
node_signer: &NS, best_block_height: u32, send_payment_along_path: F | ||
) -> Result<(), PaymentSendFailure> | ||
where | ||
R::Target: Router, | ||
ES::Target: EntropySource, | ||
NS::Target: NodeSigner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use references to a Deref
. Just use Deref
and update the call sites to do &*
if needed. Then the tests don't need &&
.
Likewise elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't able to get this working. Let me know if you have a patch that works.
}; | ||
let err = if on_retry { | ||
outbound_payments.add_new_pending_payment(PaymentHash([0; 32]), None, PaymentId([0; 32]), | ||
&Route { paths: vec![], payment_params: None }, Retry::Attempts(1), Some(route_params.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation looks off.
0.0.114 - Mar 3, 2023 - "Faster Async BOLT12 Retries" API Updates =========== * `InvoicePayer` has been removed and its features moved directly into `ChannelManager`. As such it now requires a simplified `Router` and supports `send_payment_with_retry` (and friends). `ChannelManager::retry_payment` was removed in favor of the automated retries. Invoice payment utilities in `lightning-invoice` now call the new code (lightningdevkit#1812, lightningdevkit#1916, lightningdevkit#1929, lightningdevkit#2007, etc). * `Sign`/`BaseSign` has been renamed `ChannelSigner`, with `EcdsaChannelSigner` split out in anticipation of future schnorr/taproot support (lightningdevkit#1967). * The catch-all `KeysInterface` was split into `EntropySource`, `NodeSigner`, and `SignerProvider`. `KeysManager` implements all three (lightningdevkit#1910, lightningdevkit#1930). * `KeysInterface::get_node_secret` is now `KeysManager::get_node_secret_key` and is no longer required for external signers (lightningdevkit#1951, lightningdevkit#2070). * A `lightning-transaction-sync` crate has been added which implements keeping LDK in sync with the chain via an esplora server (lightningdevkit#1870). Note that it can only be used on nodes that *never* ran a previous version of LDK. * `Score` is updated in `BackgroundProcessor` instead of via `Router` (lightningdevkit#1996). * `ChainAccess::get_utxo` (now `UtxoAccess`) can now be resolved async (lightningdevkit#1980). * BOLT12 `Offer`, `InvoiceRequest`, `Invoice` and `Refund` structs as well as associated builders have been added. Such invoices cannot yet be paid due to missing support for blinded path payments (lightningdevkit#1927, lightningdevkit#1908, lightningdevkit#1926). * A `lightning-custom-message` crate has been added to make combining multiple custom messages into one enum/handler easier (lightningdevkit#1832). * `Event::PaymentPathFailure` is now generated for failure to send an HTLC over the first hop on our local channel (lightningdevkit#2014, lightningdevkit#2043). * `lightning-net-tokio` no longer requires an `Arc` on `PeerManager` (lightningdevkit#1968). * `ChannelManager::list_recent_payments` was added (lightningdevkit#1873). * `lightning-background-processor` `std` is now optional in async mode (lightningdevkit#1962). * `create_phantom_invoice` can now be used in `no-std` (lightningdevkit#1985). * The required final CLTV delta on inbound payments is now configurable (lightningdevkit#1878) * bitcoind RPC error code and message are now surfaced in `block-sync` (lightningdevkit#2057). * Get `historical_estimated_channel_liquidity_probabilities` was added (lightningdevkit#1961). * `ChannelManager::fail_htlc_backwards_with_reason` was added (lightningdevkit#1948). * Macros which implement serialization using TLVs or straight writing of struct fields are now public (lightningdevkit#1823, lightningdevkit#1976, lightningdevkit#1977). Backwards Compatibility ======================= * Any inbound payments with a custom final CLTV delta will be rejected by LDK if you downgrade prior to receipt (lightningdevkit#1878). * `Event::PaymentPathFailed::network_update` will always be `None` if an 0.0.114-generated event is read by a prior version of LDK (lightningdevkit#2043). * `Event::PaymentPathFailed::all_paths_removed` will always be false if an 0.0.114-generated event is read by a prior version of LDK. Users who rely on it to determine payment retries should migrate to `Event::PaymentFailed`, in a separate release prior to upgrading to LDK 0.0.114 if downgrading is supported (lightningdevkit#2043). Performance Improvements ======================== * Channel data is now stored per-peer and channel updates across multiple peers can be operated on simultaneously (lightningdevkit#1507). * Routefinding is roughly 1.5x faster (lightningdevkit#1799). * Deserializing a `NetworkGraph` is roughly 6x faster (lightningdevkit#2016). * Memory usage for a `NetworkGraph` has been reduced substantially (lightningdevkit#2040). * `KeysInterface::get_secure_random_bytes` is roughly 200x faster (lightningdevkit#1974). Bug Fixes ========= * Fixed a bug where a delay in processing a `PaymentSent` event longer than the time taken to persist a `ChannelMonitor` update, when occurring immediately prior to a crash, may result in the `PaymentSent` event being lost (lightningdevkit#2048). * Fixed spurious rejections of rapid gossip sync data when the graph has been updated by other means between gossip syncs (lightningdevkit#2046). * Fixed a panic in `KeysManager` when the high bit of `starting_time_nanos` is set (lightningdevkit#1935). * Resolved an issue where the `ChannelManager::get_persistable_update_future` future would fail to wake until a second notification occurs (lightningdevkit#2064). * Resolved a memory leak when using `ChannelManager::send_probe` (lightningdevkit#2037). * Fixed a deadlock on some platforms at least when using async `ChannelMonitor` updating (lightningdevkit#2006). * Removed debug-only assertions which were reachable in threaded code (lightningdevkit#1964). * In some cases when payment sending fails on our local channel retries no longer take the same path and thus never succeed (lightningdevkit#2014). * Retries for spontaneous payments have been fixed (lightningdevkit#2002). * Return an `Err` if `lightning-persister` fails to read the directory listing rather than panicing (lightningdevkit#1943). * `peer_disconnected` will now never be called without `peer_connected` (lightningdevkit#2035) Security ======== 0.0.114 fixes several denial-of-service vulnerabilities which are reachable from untrusted input from channel counterparties or in deployments accepting inbound connections or channels. It also fixes a denial-of-service vulnerability in rare cases in the route finding logic. * The number of pending un-funded channels as well as peers without funded channels is now limited to avoid denial of service (lightningdevkit#1988). * A second `channel_ready` message received immediately after the first could lead to a spurious panic (lightningdevkit#2071). This issue was introduced with 0conf support in LDK 0.0.107. * A division-by-zero issue was fixed in the `ProbabilisticScorer` if the amount being sent (including previous-hop fees) is equal to a channel's capacity while walking the graph (lightningdevkit#2072). The division-by-zero was introduced with historical data tracking in LDK 0.0.112. In total, this release features 130 files changed, 21457 insertions, 10113 deletions in 343 commits from 18 authors, in alphabetical order: * Alec Chen * Allan Douglas R. de Oliveira * Andrei * Arik Sosman * Daniel Granhão * Duncan Dean * Elias Rohrer * Jeffrey Czyz * John Cantrell * Kurtsley * Matt Corallo * Max Fang * Omer Yacine * Valentine Wallace * Viktor Tigerström * Wilmer Paulino * benthecarman * jurvis
Adds automatic payment retry logic to
ChannelManager
.It may look like a big diff, but it's really just +50 in
ChannelManager
and +200 inoutbound_payment
(mostly copied fromInvoicePayer
), the rest is largely tests.Planned follow-up PRs (roughly in order):
PaymentPathFailed
events (and migrate relevant tests fromInvoicePayer
)InvoicePayer
with individual invoice-paying utilities + migrate remainingInvoicePayer
testsChannelManager
's varioussend_payment
methods (i.e. we probably to renamesend_payment_with_retry -> send_payment
andsend_payment -> send_payment_with_route
, etc)PendingHTLCsForwardable
events, seeChannelManager
Payment Retries #1916 (comment)And add a new
APIError::RouteNotFound
at some pointTODOs for this PR:
Seeking feedback on the plan for follow-up PRs above^
Based on #1923 + #1812