-
Notifications
You must be signed in to change notification settings - Fork 402
Process monitor update events in block_[dis]connected asynchronously #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process monitor update events in block_[dis]connected asynchronously #808
Conversation
Codecov Report
@@ Coverage Diff @@
## main #808 +/- ##
==========================================
+ Coverage 90.97% 91.02% +0.05%
==========================================
Files 48 48
Lines 26452 26547 +95
==========================================
+ Hits 24064 24165 +101
+ Misses 2388 2382 -6
Continue to review full report at Codecov.
|
fa7df00
to
367174b
Compare
Pushed a test, the issue I thought might be there wasnt an issue. |
BackgroundManagerEvent::ClosingMonitorUpdate((funding_txo, update)) => { | ||
// The channel has already been closed, so no use bothering to care about the | ||
// monitor updating completing. | ||
let _ = self.chain_monitor.update_channel(funding_txo, update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Seems a bit weird to just let the update die silently. Should we log? Also, should we put it back in the event queue and give it 3 chances to succeed, or so? Seems like a loose end to trust the counterparty to get the commit tx confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we don't really handle it anywhere else - in the case of a TemporaryFailure the API requires users to have stored it somewhere (as we never provide duplicate/old monitor updates). In the case of a permanent failure, indeed, we're a little hosed, but that isn't an issue specific to this case - in any permanent failure case if the final force-closure monitor update fails to be delivered the user will need to manually intervene and call the relevant method to get the latest commitment transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the update succeed but persistence fail? Would this be a problem to ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "the update succeed but persistence fail"? Monitor Update success includes persistence, but I'm not sure what exactly you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, what I meant was Watch::update_channel
includes both updating the channel monitor (i.e., ChannelMonitor::update_monitor
) and persisting it (i.e., Persist::update_persisted_channel
). Though, I suppose the errors are already logged in the case of ChainMonitor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I don't think this is a unique case there - anything applies already, if anything this callsite is much less error-prone because the only thing it does is broadcasts our latest transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still have a pathological case when funding_tx is disconnected and we try to force-close the channel with a holder commitment. It won't propagate or confirm. If the latest user balance is substantial, even manual intervention won't solve the issue.
Ideally, as soon as we a counterparty funding transaction we should cache it. If the funding is reorg-out later, we should attach the funding_tx with our holder commitment and why not a high-feerate CPFP. This could be implemented by either ChannelManager
or ChannelMonitor
. Though a bit complex and beyond the scope of this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's definitely a lot for this PR (also because its not a regression - we have the same issue today). That said, I don't really think its worth it - in theory the funding transaction that was just un-confirmed should be in the mempool as nodes re-add disconnected transactions to their mempool. If we want to add a high-feerate CPFP on top of the commitment, that stands alone as a change to ChannelMonitor in handling the ChannelForceClosed
event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the same. Users playing with substantial amount should scale their funding transactions confirmations lock-down high-enough (minimum_depth
) for this never to happen. For low-conf channels (1-2 blocks), I don't think that's a concern for now.
// Note that we MUST NOT end up calling methods on self.chain_monitor here - we're called | ||
// during initialization prior to the chain_monitor being fully configured in some cases. | ||
// See the docs for `ChannelManagerReadArgs` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, is there a way to put some asserts on chain_monitor
state that it's the same at the beginning and end of these functions? Seems safer than just leaving comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I thought about that, we would have to wrap the entire chain_monitor
in a wrapper struct that will check a boolean before calling the inner method. I figured it wasn't worth the complexity, but I'm open to other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem to indicate a need to refactor the code such that chain_monitor
cannot be called in certain scenarios. I don't have a concrete suggestion, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to go there, we could have deserialization build some intermediate object which you can only connect/disconnect blocks on, then you can tell it you're done and get the full ChannelManager
. Not sure how that would interact with #800, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to address now, but as part of #800 we may want to catalog all that ChannelManager
is "managing" and identify suitable abstractions where possible. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment describing this on 800.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I thought about that, we would have to wrap the entire
chain_monitor
in a wrapper struct that will check a boolean before calling the inner method. I figured it wasn't worth the complexity, but I'm open to other opinions.
Hm, could you retrieve the ChainMonitor
's list of ChannelMonitor
outpoints and latest update ID's at the beginning of the function, then ensure those are the same at the end of the function?
367174b
to
0bd920b
Compare
In the spirit of smaller PRs to review, couldn't the first two commits be a separate PR? The following commits don't depend on it any way, right? |
Strike that last part. I missed a couple commits. :P |
I could move the indentation fix commit. The first commit could also be moved, but at least its tangentially related :). If you think we'd improve velocity with tighter PR sizes, I'm happy to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move the indentation fix commit. The first commit could also be moved, but at least its tangentially related :). If you think we'd improve velocity with tighter PR sizes, I'm happy to do it.
I guess my feeling is that if it could be reviewed in parallel by another reviewer, then it would be quicker to push these through as it would easier to carve out review time.
BackgroundManagerEvent::ClosingMonitorUpdate((funding_txo, update)) => { | ||
// The channel has already been closed, so no use bothering to care about the | ||
// monitor updating completing. | ||
let _ = self.chain_monitor.update_channel(funding_txo, update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the update succeed but persistence fail? Would this be a problem to ignore?
// Note that we MUST NOT end up calling methods on self.chain_monitor here - we're called | ||
// during initialization prior to the chain_monitor being fully configured in some cases. | ||
// See the docs for `ChannelManagerReadArgs` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem to indicate a need to refactor the code such that chain_monitor
cannot be called in certain scenarios. I don't have a concrete suggestion, however.
lightning/src/ln/channelmanager.rs
Outdated
// It looks like our counterparty went on-chain. We cannot broadcast our latest local | ||
// state via monitor update (as Channel::force_shutdown tries to make us do) as we may | ||
// still be in initialization, so we track the update internally and handle it when the | ||
// user next calls timer_chan_freshness_every_min, guaranteeing we're running normally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a pattern here possibly related to #800 where ChannelManager
's functionality is limited by the mode it's in. Do you think a refactoring is necessary to enforce this rather than relying on comments? Maybe that issue could be repurposed to be broader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's definitely a good point, I think we'd really want to figure out exactly what the API looks like in #800, though I have a feeling a lot of it is doing to end up being deep in the Channel state machine more than limits on the external API (we can always shove outbound things into channel holding cells anyway).
/// 5) Move the ChannelMonitors into your local chain::Watch. | ||
/// 6) Disconnect/connect blocks on the ChannelManager. | ||
/// 5) Disconnect/connect blocks on the ChannelManager. | ||
/// 6) Move the ChannelMonitors into your local chain::Watch. | ||
/// | ||
/// Note that the ordering of #4-6 is not of importance, however all three must occur before you | ||
/// call any other methods on the newly-deserialized ChannelManager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the note in (2) still accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean, do you mean the part below? If so, yes, that behavior is unchanged - we call broadcast_latest_holder_commitment_txn
directly on the passed monitors instead of relying on an update object.
/// This may result in closing some Channels if the ChannelMonitor is newer than the stored
/// ChannelManager state to ensure no loss of funds. Thus, transactions may be broadcasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that part.
0bd920b
to
3654169
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying to comments and will take another pass a little later today.
BackgroundManagerEvent::ClosingMonitorUpdate((funding_txo, update)) => { | ||
// The channel has already been closed, so no use bothering to care about the | ||
// monitor updating completing. | ||
let _ = self.chain_monitor.update_channel(funding_txo, update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, what I meant was Watch::update_channel
includes both updating the channel monitor (i.e., ChannelMonitor::update_monitor
) and persisting it (i.e., Persist::update_persisted_channel
). Though, I suppose the errors are already logged in the case of ChainMonitor
.
/// 5) Move the ChannelMonitors into your local chain::Watch. | ||
/// 6) Disconnect/connect blocks on the ChannelManager. | ||
/// 5) Disconnect/connect blocks on the ChannelManager. | ||
/// 6) Move the ChannelMonitors into your local chain::Watch. | ||
/// | ||
/// Note that the ordering of #4-6 is not of importance, however all three must occur before you | ||
/// call any other methods on the newly-deserialized ChannelManager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that part.
// Note that we MUST NOT end up calling methods on self.chain_monitor here - we're called | ||
// during initialization prior to the chain_monitor being fully configured in some cases. | ||
// See the docs for `ChannelManagerReadArgs` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to address now, but as part of #800 we may want to catalog all that ChannelManager
is "managing" and identify suitable abstractions where possible. :)
nodes[0].node.test_process_background_events(); // Required to free the pending background monitor update | ||
check_added_monitors!(nodes[0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe just for a tad more testing on deserialization, could assert that the funding tx and monitor update are as expected or something like that.
lightning/src/ln/reorg_tests.rs
Outdated
@@ -180,3 +188,185 @@ fn test_onchain_htlc_claim_reorg_remote_commitment() { | |||
fn test_onchain_htlc_timeout_delay_remote_commitment() { | |||
do_test_onchain_htlc_reorg(false, false); | |||
} | |||
|
|||
fn do_test_unconf_chan(reload_node: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it makes too much of a difference, but we could add a test that's closer to a real-world scenario if we tested deserializing and then syncing all the new blocks to tip (thus generating a BackgroundEvent
). (And then watch_channel
'd and process_background_events
'd, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda cheated and expanded the test to either do the disconnect before or after the reload.
3654169
to
8a9088f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. Need to review the tests.
lightning/src/ln/channelmanager.rs
Outdated
/// 3) Register all relevant ChannelMonitor outpoints with your chain watch mechanism using | ||
/// ChannelMonitor::get_outputs_to_watch() and ChannelMonitor::get_funding_txo(). | ||
/// 3) If you are not fetching full blocks, register all relevant ChannelMonitor outpoints with | ||
/// your chain watch mechanism using ChannelMonitor::get_outputs_to_watch() and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: chain::Watch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to reference chain::Filter
.
0eac67b
to
b7b895c
Compare
Rebased (and included a commit which fixed the rebase-introduced error). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments on clarifying the tests.
functional_tests.rs is huge, so anything we can do to split it up some is helpful. This also exposes a somewhat glaring lack of reorgs in our existing tests.
See diff for more details
The return value from Channel::force_shutdown previously always returned a `ChannelMonitorUpdate`, but expected it to only be applied in the case that it *also* returned a Some for the funding transaction output. This is confusing, instead we move the `ChannelMontiorUpdate` inside the Option, making it hold a tuple instead.
f4a70e6
to
1869f3d
Compare
Rebased on upstream and squashed. |
/// | ||
/// Expects the caller to have a total_consistency_lock read lock. | ||
fn process_background_events(&self) { | ||
let mut background_events = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can assert the lock consistency requirement assert(&self.total_consistency_lock.try_write().is_some())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it only needs a read lock, not a write lock, and there's no way to assert that the current thread holds one - we could assert that no thread holds a write lock, but that's a not quite sufficient.
BackgroundManagerEvent::ClosingMonitorUpdate((funding_txo, update)) => { | ||
// The channel has already been closed, so no use bothering to care about the | ||
// monitor updating completing. | ||
let _ = self.chain_monitor.update_channel(funding_txo, update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still have a pathological case when funding_tx is disconnected and we try to force-close the channel with a holder commitment. It won't propagate or confirm. If the latest user balance is substantial, even manual intervention won't solve the issue.
Ideally, as soon as we a counterparty funding transaction we should cache it. If the funding is reorg-out later, we should attach the funding_tx with our holder commitment and why not a high-feerate CPFP. This could be implemented by either ChannelManager
or ChannelMonitor
. Though a bit complex and beyond the scope of this PR...
pub(crate) fn test_process_background_events(&self) { | ||
self.process_background_events(); | ||
} | ||
|
||
/// If a peer is disconnected we mark any channels with that peer as 'disabled'. | ||
/// After some time, if channels are still disabled we need to broadcast a ChannelUpdate | ||
/// to inform the network about the uselessness of these channels. | ||
/// | ||
/// This method handles all the details, and must be called roughly once per minute. | ||
pub fn timer_chan_freshness_every_min(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't update timer_chan_freshness_every_min
name, at least update its documentation to mention the channel monitor update event flush.
/// always deserialize only the latest version of a ChannelManager and ChannelMonitors available to | ||
/// you. If you deserialize an old ChannelManager (during which force-closure transactions may be | ||
/// broadcast), and then later deserialize a newer version of the same ChannelManager (which will | ||
/// not force-close the same channels but consider them live), you may end up revoking a state for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this really possible. Let's say 1st deserialized ChannelManager force-closes the channel due to some off-chain violations from our counterparty (e.g a HTLC under minimum_msat). Force-close is dutifully sent to ChannelMonitor
and lockdown_from_offchain
latches to true. Ulteriorly, the 2nd deserialized ChannelManager should receive the same onchain block sequence but effectively not the off-chain one, so it won't close channel again. But any attempt to update channel state should be rejected by ChannelMonitor
, assuming it's the same version between ChannelManager
deserializations.
Or do you envision a different scenario ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The force-close doesn't generate a ChannelMonitorUpdate
event, which implies the user is not required to re-persist the ChannelMonitor
. So they could deserialize again with the original. We could change that, but I don't think its worth it.
ff8b0f9
to
1a74a2a
Compare
Code Review ACK 1a74a2a modulo squash! |
The instructions for `ChannelManagerReadArgs` indicate that you need to connect blocks on a newly-deserialized `ChannelManager` in a separate pass from the newly-deserialized `ChannelMontiors` as the `ChannelManager` assumes the ability to update the monitors during block [dis]connected events, saying that users need to: ``` 4) Reconnect blocks on your ChannelMonitors 5) Move the ChannelMonitors into your local chain::Watch. 6) Disconnect/connect blocks on the ChannelManager. ``` This is fine for `ChannelManager`'s purpose, but is very awkward for users. Notably, our new `lightning-block-sync` implemented on-load reconnection in the most obvious (and performant) way - connecting the blocks all at once, violating the `ChannelManagerReadArgs` API. Luckily, the events in question really don't need to be processed with the same urgency as most channel monitor updates. The only two monitor updates which can occur in block_[dis]connected is either a) in block_connected, we identify a now-confirmed commitment transaction, closing one of our channels, or b) in block_disconnected, the funding transaction is reorganized out of the chain, making our channel no longer funded. In the case of (a), sending a monitor update which broadcasts a conflicting holder commitment transaction is far from time-critical, though we should still ensure we do it. In the case of (b), we should try to broadcast our holder commitment transaction when we can, but within a few minutes is fine on the scale of block mining anyway. Note that in both cases cannot simply move the logic to ChannelMonitor::block[dis]_connected, as this could result in us broadcasting a commitment transaction from ChannelMonitor, then revoking the now-broadcasted state, and only then receiving the block_[dis]connected event in the ChannelManager. Thus, we move both events into an internal invent queue and process them in timer_chan_freshness_every_min().
As suggested by Val.
1a74a2a
to
93a7572
Compare
Squashed with no changes:
|
The instructions for
ChannelManagerReadArgs
indicate that you needto connect blocks on a newly-deserialized
ChannelManager
in aseparate pass from the newly-deserialized
ChannelMontiors
as theChannelManager
assumes the ability to update the monitors duringblock [dis]connected events, saying that users need to:
This is fine for
ChannelManager
's purpose, but is very awkwardfor users. Notably, our new
lightning-block-sync
implementedon-load reconnection in the most obvious (and performant) way -
connecting the blocks all at once, violating the
ChannelManagerReadArgs
API.Luckily, the events in question really don't need to be processed
with the same urgency as most channel monitor updates. The only two
monitor updates which can occur in block_[dis]connected is either
a) in block_connected, we identify a now-confirmed commitment
transaction, closing one of our channels, or
b) in block_disconnected, the funding transaction is reorganized
out of the chain, making our channel no longer funded.
In the case of (a), sending a monitor update which broadcasts a
conflicting holder commitment transaction is far from
time-critical, though we should still ensure we do it. In the case
of (b), we should try to broadcast our holder commitment transaction
when we can, but within a few minutes is fine on the scale of
block mining anyway.
Note that in both cases cannot simply move the logic to
ChannelMonitor::block[dis]connected, as this could result in us
broadcasting a commitment transaction from ChannelMonitor, then
revoking the now-broadcasted state, and only then receiving the
block[dis]connected event in the ChannelManager.
Thus, we move both events into an internal invent queue and process
them in timer_chan_freshness_every_min().
Plus a few random commits I had lying around that are nice to include. I need to fix one bug in deserialization and add a test for it, but otherwise this should be good.