-
Notifications
You must be signed in to change notification settings - Fork 403
Persist ChannelMonitors after new blocks are connected #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist ChannelMonitors after new blocks are connected #1108
Conversation
/// Failures here do not imply the channel will be force-closed, however any future calls to | ||
/// [`update_persisted_channel`] after an error is returned here MUST either persist the full, | ||
/// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return | ||
/// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words, | ||
/// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the | ||
/// [`ChannelMonitorUpdate`] alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simplify this by making update_persisted_channel
take an Option<&ChannelMonitorUpdate>
and forgoing adding sync_persisted_channel
method to the trait? Then when the user sees None
they must write the full ChannelMonitor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value distinction is the harder part to capture - users don't really have the option to return a PermanentFailure
here as the call site (the ChainMonitor
) doesn't have the ability to force-close a channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel
once sync_persisted_channel
returns an error rather than rely on the user to implement update_persisted_channel
correctly? Or at very least never pass them a ChannelMonitorUpdate
there (making it an Option
) even if we still need the extra method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's a good point, I guess I hadn't thought about making the update an option and skipping it if we've marked a channel as "previous sync didn't persist". The "don't call update_persisted_channel
" change, though feels wrong - technically if the user complies with the API docs we can continue operating fine, after a restart, but if we don't call update_persisted_channel
at all we have to force-close the channel immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if after a block comes in, we set a bool in ChannelMonitor
indicating there’s been a new block? Then we can change update_persisted_channel
to pass in whether a full sync is needed to avoid a big chain resync.
Can also check the new bool in timer_tick
and call update_persisted_channel
then, if we also make the ChannelMonitorUpdate
an Option
maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to sync the monitor before we can hand any chain-generated events back to the ChannelManager, so I don't think we can avoid a sync at all.
Codecov Report
@@ Coverage Diff @@
## main #1108 +/- ##
==========================================
- Coverage 90.58% 90.40% -0.19%
==========================================
Files 66 67 +1
Lines 34459 34658 +199
==========================================
+ Hits 31215 31332 +117
- Misses 3244 3326 +82
Continue to review full report at Codecov.
|
29c7f1e
to
60a8a70
Compare
Also shoved in two commtis to move |
322bc11
to
bde998c
Compare
/// Failures here do not imply the channel will be force-closed, however any future calls to | ||
/// [`update_persisted_channel`] after an error is returned here MUST either persist the full, | ||
/// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return | ||
/// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words, | ||
/// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the | ||
/// [`ChannelMonitorUpdate`] alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel
once sync_persisted_channel
returns an error rather than rely on the user to implement update_persisted_channel
correctly? Or at very least never pass them a ChannelMonitorUpdate
there (making it an Option
) even if we still need the extra method.
lightning/src/chain/chainmonitor.rs
Outdated
let mut ev_lock = self.event_mutex.lock().unwrap(); | ||
txn_outputs = process(monitor, txdata); | ||
log_trace!(self.logger, "Syncing Channel Monitor for channel {}", log_funding_info!(monitor)); | ||
if let Err(()) = self.persister.sync_persisted_channel(*funding_outpoint, monitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plans to aggregate updates across multiple blocks (in case of a restart)? Or maybe that is also problematic if offline for awhile and wouldn't persist until the end of the sync. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're "just fine" here, somewhat due to an API quirk. The on-load replay is supposed to be per-channelmonitor, and not via the chain::Watch/ChainMonitor, or at least we recommend it. Technically I think a user could use ChainMontior during that time, but at least we dont?
Re: the first commit, would it be possible to separate out the fix for the redundant chain sync from the fix for the duplicate PaymentSent scenario? Seems that might ease review |
bde998c
to
96d9895
Compare
Making this a draft for now, I think it'll end up depending on a new test-refactor/ChainMonitor-api-refactor pr. |
96d9895
to
66cc962
Compare
This is now based on #1112 and should have largely addressed the feedback. |
374e622
to
d3f0772
Compare
d3f0772
to
33c37f0
Compare
895d4ce
to
f73999d
Compare
Tested as a part of #1130. |
4725103
to
f0aa603
Compare
monitor_state.last_chain_persist_height.load(Ordering::Acquire) + LATENCY_GRACE_PERIOD_BLOCKS as usize | ||
> self.highest_chain_height.load(Ordering::Acquire) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be an edge case around reorgs here where highest_chain_height
is reduced while blocks are disconnected, release_pending_monitor_events
is called, and then the new blocks are connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, was thinking it wasn't worth worrying about, but its easy enough to just fetch_max
, so I did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the testing! Almost ready to sign off
entry.insert(MonitorHolder { | ||
monitor, | ||
pending_monitor_updates: Mutex::new(pending_monitor_updates), | ||
channel_perm_failed: AtomicBool::new(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be a bit smarter and check monitor.locked_from_offchain
? Or, is there any way to inquire the monitor rather than defaulting to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care? If we get locked_from_offchain
that means ChannelManager told us the channel is closed, so we're probably not gonna get into watch_channel
at that point. The only case where it can be perm-failed at this point is if we failed to persist above, but then we'd have returned early already.
We also take this opportunity to drop byte_utils::le64_to_array, as our MSRV now supports the native to_le_bytes() call.
Squashed all fixups except for new ones from the latest round of feedback. |
3eb7879
to
2a22a16
Compare
lightning/src/chain/chainmonitor.rs
Outdated
let mut old_height = self.highest_chain_height.load(Ordering::Relaxed); | ||
while self.highest_chain_height | ||
.compare_exchange(old_height, height as usize, Ordering::AcqRel, Ordering::Relaxed).is_err() | ||
{ | ||
old_height = self.highest_chain_height.load(Ordering::Acquire); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So taking a closer look, we actually never call process_chain_data
when blocks are disconnected, so highest_chain_height
won't be reduced. I suppose a user could call best_block_updated
with a smaller height, though, and there's some logic in ChannelMonitor
to handle this case.
But when given a smaller height, won't this not be equivalent to fetch_max
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its still an issue in 2-block reorgs - you may disconnect two blocks and then connect one and see a height that is one lower than the target.
And, oops, yes, added the requisite compare :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8f47cf8
to
c1c5462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pretty much good to go, a couple comments outstanding.
lightning/src/chain/chainmonitor.rs
Outdated
/// it is up to you to maintain a correct mapping between the outpoint and the | ||
/// stored channel data). Note that you **must** persist every new monitor to | ||
/// disk. See the `Persist` trait documentation for more details. | ||
/// Persist a new channel's data. The data can be stored any way you want, but the identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to circle back on this. We could possibly rename persist_new_channel
to persist_channel
and pass it enum stating the type of persistence: new channel, re-persistence, etc. Then get rid of the Option
in the update_persisted_channel
parameter. But that may involve piping the enum through chain::Watch
? I don't feel strongly about it, but the docs may need to be clarified here.
lightning/src/chain/chainmonitor.rs
Outdated
/// it is up to you to maintain a correct mapping between the outpoint and the | ||
/// stored channel data). Note that you **must** persist every new monitor to | ||
/// disk. See the `Persist` trait documentation for more details. | ||
/// Persist a new channel's data. The data can be stored any way you want, but the identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my concern is also that the
chain::Watch
should/should be able to reject new channelmonitors for the same outpoint twice. I believe we actually rely on this behavior somewhat to avoid some attacks. If we have an enum that controls whether the user is supposed to return an error or not, I'm very worried they just won't do it.
Good point. Happy to leave it as is.
Which docs do you think need to be updated?
Ah, just that this is not necessarily a new channel since it may be called by watch_channel
upon restart as per ChannelManagerReadArgs
' docs.
lightning/src/chain/chainmonitor.rs
Outdated
let mut old_height = self.highest_chain_height.load(Ordering::Relaxed); | ||
let new_height = height as usize; | ||
while new_height > old_height && self.highest_chain_height | ||
.compare_exchange(old_height, new_height, Ordering::AcqRel, Ordering::Relaxed).is_err() | ||
{ | ||
old_height = self.highest_chain_height.load(Ordering::Acquire); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find this simpler/more readable:
let old_height = ..
let new_height = ..
if new_height > old_height {
self.highest_chain_height.store(new_height)
}
maybe I'm missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh, yes, can do that as long as its under the lock. I've moved the code down and simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 240fc03
240fc03
to
a65686c
Compare
Squashed without changes, diff from Val's ACK (will land after CI):
|
In the next commit we'll need ChainMonitor to "see" when a monitor persistence completes, which means `monitor_updated` needs to move to `ChainMonitor`. The simplest way to then communicate that information to `ChannelManager` is via `MonitorEvet`s, which seems to line up ok, even if they're now constructed by multiple different places.
In the next commit, we'll be originating monitor updates both from the ChainMonitor and from the ChannelManager, making simple sequential update IDs impossible. Further, the existing async monitor update API was somewhat hard to work with - instead of being able to generate monitor_updated callbacks whenever a persistence process finishes, you had to ensure you only did so at least once all previous updates had also been persisted. Here we eat the complexity for the user by moving to an opaque type for monitor updates, tracking which updates are in-flight for the user and only generating monitor-persisted events once all pending updates have been committed.
a65686c
to
a3d07e6
Compare
Oops, added missing doclinks and dropped a link in private docs that needed linking, will land after CI:
|
This resolves several user complaints (and issues in the sample node) where startup is substantially delayed as we're always waiting for the chain data to sync. Further, in an upcoming PR, we'll be reloading pending payments from ChannelMonitors on restart, at which point we'll need the change here which avoids handling events until after the user has confirmed the `ChannelMonitor` has been persisted to disk. It will avoid a race where we * send a payment/HTLC (persisting the monitor to disk with the HTLC pending), * force-close the channel, removing the channel entry from the ChannelManager entirely, * persist the ChannelManager, * connect a block which contains a fulfill of the HTLC, generating a claim event, * handle the claim event while the `ChannelMonitor` is being persisted, * persist the ChannelManager (before the CHannelMonitor is persisted fully), * restart, reloading the HTLC as a pending payment in the ChannelManager, which now has no references to it except from the ChannelMonitor which still has the pending HTLC, * replay the block connection, generating a duplicate PaymentSent event.
ChannelMonitors now require that they be re-persisted before MonitorEvents be provided to the ChannelManager, the exact thing that test_dup_htlc_onchain_fails_on_reload was testing for when it *didn't* happen. As such, test_dup_htlc_onchain_fails_on_reload is now testing that we bahve correctly when the API guarantees are not met, something we don't need to do. Here, we adapt it to test the new API requirements through ChainMonitor's calls to the Persist trait instead.
If we have a `ChannelMonitor` update from an on-chain event which returns a `TemporaryFailure`, we block `MonitorEvent`s from that `ChannelMonitor` until the update is persisted. This prevents duplicate payment send events to the user after payments get reloaded from monitors on restart. However, if the event being avoided isn't going to generate a PaymentSent, but instead result in us claiming an HTLC from an upstream channel (ie the HTLC was forwarded), then the result of a user delaying the event is that we delay getting our money, not a duplicate event. Because user persistence may take an arbitrary amount of time, we need to bound the amount of time we can possibly wait to return events, which we do here by bounding it to 3 blocks. Thanks to Val for catching this in review.
Its somewhat confusing that `persist_new_channel` is called on startup for an existing channel in common deployments, so we call it out explicitly.
a3d07e6
to
6fb5bd3
Compare
This resolves several user complaints (and issues in the sample
node) where startup is substantially delayed as we're always
waiting for the chain data to sync.
Further, in an upcoming PR, we'll be reloading pending payments
from ChannelMonitors on restart, at which point we'll need the
change here which avoids handling events until after the user
has confirmed the
ChannelMonitor
has been persisted to disk.It will avoid a race where we
HTLC pending),
ChannelManager entirely,
a claim event,
ChannelMonitor
is beingpersisted,
persisted fully),
ChannelManager, which now has no references to it except from
the ChannelMonitor which still has the pending HTLC,
event.