Skip to content

Persist ChannelMonitors after new blocks are connected #1108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

This resolves several user complaints (and issues in the sample
node) where startup is substantially delayed as we're always
waiting for the chain data to sync.

Further, in an upcoming PR, we'll be reloading pending payments
from ChannelMonitors on restart, at which point we'll need the
change here which avoids handling events until after the user
has confirmed the ChannelMonitor has been persisted to disk.
It will avoid a race where we

  • send a payment/HTLC (persisting the monitor to disk with the
    HTLC pending),
  • force-close the channel, removing the channel entry from the
    ChannelManager entirely,
  • persist the ChannelManager,
  • connect a block which contains a fulfill of the HTLC, generating
    a claim event,
  • handle the claim event while the ChannelMonitor is being
    persisted,
  • persist the ChannelManager (before the CHannelMonitor is
    persisted fully),
  • restart, reloading the HTLC as a pending payment in the
    ChannelManager, which now has no references to it except from
    the ChannelMonitor which still has the pending HTLC,
  • replay the block connection, generating a duplicate PaymentSent
    event.

Comment on lines 2942 to 2947
/// Failures here do not imply the channel will be force-closed, however any future calls to
/// [`update_persisted_channel`] after an error is returned here MUST either persist the full,
/// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return
/// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words,
/// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the
/// [`ChannelMonitorUpdate`] alone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simplify this by making update_persisted_channel take an Option<&ChannelMonitorUpdate> and forgoing adding sync_persisted_channel method to the trait? Then when the user sees None they must write the full ChannelMonitor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value distinction is the harder part to capture - users don't really have the option to return a PermanentFailure here as the call site (the ChainMonitor) doesn't have the ability to force-close a channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel once sync_persisted_channel returns an error rather than rely on the user to implement update_persisted_channel correctly? Or at very least never pass them a ChannelMonitorUpdate there (making it an Option) even if we still need the extra method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's a good point, I guess I hadn't thought about making the update an option and skipping it if we've marked a channel as "previous sync didn't persist". The "don't call update_persisted_channel" change, though feels wrong - technically if the user complies with the API docs we can continue operating fine, after a restart, but if we don't call update_persisted_channel at all we have to force-close the channel immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if after a block comes in, we set a bool in ChannelMonitor indicating there’s been a new block? Then we can change update_persisted_channel to pass in whether a full sync is needed to avoid a big chain resync.

Can also check the new bool in timer_tick and call update_persisted_channel then, if we also make the ChannelMonitorUpdate an Option maybe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to sync the monitor before we can hand any chain-generated events back to the ChannelManager, so I don't think we can avoid a sync at all.

@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #1108 (240fc03) into main (dda86a0) will decrease coverage by 0.18%.
The diff coverage is 90.27%.

❗ Current head 240fc03 differs from pull request most recent head 6fb5bd3. Consider uploading reports for the commit 6fb5bd3 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1108      +/-   ##
==========================================
- Coverage   90.58%   90.40%   -0.19%     
==========================================
  Files          66       67       +1     
  Lines       34459    34658     +199     
==========================================
+ Hits        31215    31332     +117     
- Misses       3244     3326      +82     
Impacted Files Coverage Δ
lightning/src/chain/mod.rs 61.11% <ø> (ø)
lightning/src/util/byte_utils.rs 100.00% <ø> (ø)
lightning/src/chain/channelmonitor.rs 90.94% <64.70%> (-0.32%) ⬇️
lightning/src/ln/channelmanager.rs 83.61% <77.77%> (-1.52%) ⬇️
lightning/src/chain/chainmonitor.rs 90.87% <85.71%> (-5.49%) ⬇️
lightning-persister/src/lib.rs 94.30% <100.00%> (+0.09%) ⬆️
lightning/src/ln/chanmon_update_fail_tests.rs 97.65% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.30% <100.00%> (-0.03%) ⬇️
lightning/src/ln/peer_handler.rs 45.67% <100.00%> (-0.21%) ⬇️
lightning/src/util/atomic_counter.rs 100.00% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dda86a0...6fb5bd3. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from 29c7f1e to 60a8a70 Compare October 5, 2021 18:11
@TheBlueMatt
Copy link
Collaborator Author

Also shoved in two commtis to move Persist and ChannelMonitorUpdateErr cause their locations was just wrong.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch 2 times, most recently from 322bc11 to bde998c Compare October 5, 2021 18:38
Comment on lines 2942 to 2947
/// Failures here do not imply the channel will be force-closed, however any future calls to
/// [`update_persisted_channel`] after an error is returned here MUST either persist the full,
/// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return
/// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words,
/// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the
/// [`ChannelMonitorUpdate`] alone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel once sync_persisted_channel returns an error rather than rely on the user to implement update_persisted_channel correctly? Or at very least never pass them a ChannelMonitorUpdate there (making it an Option) even if we still need the extra method.

let mut ev_lock = self.event_mutex.lock().unwrap();
txn_outputs = process(monitor, txdata);
log_trace!(self.logger, "Syncing Channel Monitor for channel {}", log_funding_info!(monitor));
if let Err(()) = self.persister.sync_persisted_channel(*funding_outpoint, monitor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plans to aggregate updates across multiple blocks (in case of a restart)? Or maybe that is also problematic if offline for awhile and wouldn't persist until the end of the sync. 😕

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're "just fine" here, somewhat due to an API quirk. The on-load replay is supposed to be per-channelmonitor, and not via the chain::Watch/ChainMonitor, or at least we recommend it. Technically I think a user could use ChainMontior during that time, but at least we dont?

@valentinewallace
Copy link
Contributor

Re: the first commit, would it be possible to separate out the fix for the redundant chain sync from the fix for the duplicate PaymentSent scenario? Seems that might ease review

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from bde998c to 96d9895 Compare October 7, 2021 20:38
@TheBlueMatt TheBlueMatt marked this pull request as draft October 8, 2021 18:57
@TheBlueMatt
Copy link
Collaborator Author

Making this a draft for now, I think it'll end up depending on a new test-refactor/ChainMonitor-api-refactor pr.

@TheBlueMatt
Copy link
Collaborator Author

This is now based on #1112 and should have largely addressed the feedback.

@TheBlueMatt TheBlueMatt marked this pull request as ready for review October 8, 2021 23:35
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch 6 times, most recently from 374e622 to d3f0772 Compare October 10, 2021 23:34
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from d3f0772 to 33c37f0 Compare October 13, 2021 18:44
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch 2 times, most recently from 895d4ce to f73999d Compare October 13, 2021 20:07
@TheBlueMatt
Copy link
Collaborator Author

Tested as a part of #1130.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from 4725103 to f0aa603 Compare October 17, 2021 22:59
Comment on lines +663 to +677
monitor_state.last_chain_persist_height.load(Ordering::Acquire) + LATENCY_GRACE_PERIOD_BLOCKS as usize
> self.highest_chain_height.load(Ordering::Acquire)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be an edge case around reorgs here where highest_chain_height is reduced while blocks are disconnected, release_pending_monitor_events is called, and then the new blocks are connected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yea, was thinking it wasn't worth worrying about, but its easy enough to just fetch_max, so I did that.

@valentinewallace valentinewallace removed the MOAR TEST PLZ NEEDS MOAR TEST label Oct 18, 2021
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the testing! Almost ready to sign off

entry.insert(MonitorHolder {
monitor,
pending_monitor_updates: Mutex::new(pending_monitor_updates),
channel_perm_failed: AtomicBool::new(false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be a bit smarter and check monitor.locked_from_offchain? Or, is there any way to inquire the monitor rather than defaulting to false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we care? If we get locked_from_offchain that means ChannelManager told us the channel is closed, so we're probably not gonna get into watch_channel at that point. The only case where it can be perm-failed at this point is if we failed to persist above, but then we'd have returned early already.

We also take this opportunity to drop byte_utils::le64_to_array, as
our MSRV now supports the native to_le_bytes() call.
@TheBlueMatt
Copy link
Collaborator Author

Squashed all fixups except for new ones from the latest round of feedback.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch 2 times, most recently from 3eb7879 to 2a22a16 Compare October 18, 2021 23:27
Comment on lines 266 to 271
let mut old_height = self.highest_chain_height.load(Ordering::Relaxed);
while self.highest_chain_height
.compare_exchange(old_height, height as usize, Ordering::AcqRel, Ordering::Relaxed).is_err()
{
old_height = self.highest_chain_height.load(Ordering::Acquire);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So taking a closer look, we actually never call process_chain_data when blocks are disconnected, so highest_chain_height won't be reduced. I suppose a user could call best_block_updated with a smaller height, though, and there's some logic in ChannelMonitor to handle this case.

But when given a smaller height, won't this not be equivalent to fetch_max?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe its still an issue in 2-block reorgs - you may disconnect two blocks and then connect one and see a height that is one lower than the target.

And, oops, yes, added the requisite compare :)

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (mod Jeff outstanding comments)! 🚀

If anything, I wonder if testing is a touch light given the edge case-yness, but fine to chat more about that in #1130.

I'd be happy to check out #1104 next if that can get a rebase and CI fix :)

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from 8f47cf8 to c1c5462 Compare October 19, 2021 17:52
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pretty much good to go, a couple comments outstanding.

/// it is up to you to maintain a correct mapping between the outpoint and the
/// stored channel data). Note that you **must** persist every new monitor to
/// disk. See the `Persist` trait documentation for more details.
/// Persist a new channel's data. The data can be stored any way you want, but the identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to circle back on this. We could possibly rename persist_new_channel to persist_channel and pass it enum stating the type of persistence: new channel, re-persistence, etc. Then get rid of the Option in the update_persisted_channel parameter. But that may involve piping the enum through chain::Watch? I don't feel strongly about it, but the docs may need to be clarified here.

/// it is up to you to maintain a correct mapping between the outpoint and the
/// stored channel data). Note that you **must** persist every new monitor to
/// disk. See the `Persist` trait documentation for more details.
/// Persist a new channel's data. The data can be stored any way you want, but the identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my concern is also that the chain::Watch should/should be able to reject new channelmonitors for the same outpoint twice. I believe we actually rely on this behavior somewhat to avoid some attacks. If we have an enum that controls whether the user is supposed to return an error or not, I'm very worried they just won't do it.

Good point. Happy to leave it as is.

Which docs do you think need to be updated?

Ah, just that this is not necessarily a new channel since it may be called by watch_channel upon restart as per ChannelManagerReadArgs' docs.

Comment on lines 268 to 274
let mut old_height = self.highest_chain_height.load(Ordering::Relaxed);
let new_height = height as usize;
while new_height > old_height && self.highest_chain_height
.compare_exchange(old_height, new_height, Ordering::AcqRel, Ordering::Relaxed).is_err()
{
old_height = self.highest_chain_height.load(Ordering::Acquire);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find this simpler/more readable:

let old_height = ..
let new_height = ..
if new_height > old_height {
    self.highest_chain_height.store(new_height)
}

maybe I'm missing something here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, yes, can do that as long as its under the lock. I've moved the code down and simplified.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 240fc03

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from 240fc03 to a65686c Compare October 19, 2021 22:32
@TheBlueMatt
Copy link
Collaborator Author

Squashed without changes, diff from Val's ACK (will land after CI):

$ git diff-tree -U1 1816e2b96884961e04e1041a9d0f88ac347b6a33 a65686cc4
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 0e66c352d..b52eb20a7 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -264,15 +264,14 @@ where C::Target: chain::Filter,
 		let mut dependent_txdata = Vec::new();
-		if let Some(height) = best_height {
-			// Sadly AtomicUsize::fetch_max wasn't stabilized until 1.45, so until then we have to
-			// manually CAS.
-			let mut old_height = self.highest_chain_height.load(Ordering::Relaxed);
-			let new_height = height as usize;
-			while new_height > old_height && self.highest_chain_height
-				.compare_exchange(old_height, new_height, Ordering::AcqRel, Ordering::Relaxed).is_err()
-			{
-				old_height = self.highest_chain_height.load(Ordering::Acquire);
-			}
-		}
 		{
 			let monitor_states = self.monitors.write().unwrap();
+			if let Some(height) = best_height {
+				// If the best block height is being updated, update highest_chain_height under the
+				// monitors write lock.
+				let old_height = self.highest_chain_height.load(Ordering::Acquire);
+				let new_height = height as usize;
+				if new_height > old_height {
+					self.highest_chain_height.store(new_height, Ordering::Release);
+				}
+			}
+
 			for (funding_outpoint, monitor_state) in monitor_states.iter() {
$

In the next commit we'll need ChainMonitor to "see" when a monitor
persistence completes, which means `monitor_updated` needs to move
to `ChainMonitor`. The simplest way to then communicate that
information to `ChannelManager` is via `MonitorEvet`s, which seems
to line up ok, even if they're now constructed by multiple
different places.
In the next commit, we'll be originating monitor updates both from
the ChainMonitor and from the ChannelManager, making simple
sequential update IDs impossible.

Further, the existing async monitor update API was somewhat hard to
work with - instead of being able to generate monitor_updated
callbacks whenever a persistence process finishes, you had to
ensure you only did so at least once all previous updates had also
been persisted.

Here we eat the complexity for the user by moving to an opaque
type for monitor updates, tracking which updates are in-flight for
the user and only generating monitor-persisted events once all
pending updates have been committed.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from a65686c to a3d07e6 Compare October 19, 2021 23:49
@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Oct 19, 2021

Oops, added missing doclinks and dropped a link in private docs that needed linking, will land after CI:

$ git diff-tree -U1 a65686cc 6fb5bd36a
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index b52eb20a7..71b0b3e50 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -50,3 +50,3 @@ use core::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
 enum UpdateOrigin {
-	/// An update that was generated by the [`ChannelManager`] (via our `chain::Watch`
+	/// An update that was generated by the `ChannelManager` (via our `chain::Watch`
 	/// implementation). This corresponds to an actual [`ChannelMonitorUpdate::update_id`] field
@@ -107,2 +107,3 @@ pub trait Persist<ChannelSigner: Sign> {
 	///
+	/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
 	/// [`Writeable::write`]: crate::util::ser::Writeable::write
diff --git a/lightning/src/chain/mod.rs b/lightning/src/chain/mod.rs
index fbe22e6ed..25e5a97d2 100644
--- a/lightning/src/chain/mod.rs
+++ b/lightning/src/chain/mod.rs
@@ -216,2 +216,4 @@ pub enum ChannelMonitorUpdateErr {
 	/// updates will return TemporaryFailure until the remote copies could be updated.
+	///
+	/// [`ChainMonitor::channel_monitor_updated`]: chainmonitor::ChainMonitor::channel_monitor_updated
 	TemporaryFailure,
$

This resolves several user complaints (and issues in the sample
node) where startup is substantially delayed as we're always
waiting for the chain data to sync.

Further, in an upcoming PR, we'll be reloading pending payments
from ChannelMonitors on restart, at which point we'll need the
change here which avoids handling events until after the user
has confirmed the `ChannelMonitor` has been persisted to disk.
It will avoid a race where we
 * send a payment/HTLC (persisting the monitor to disk with the
   HTLC pending),
 * force-close the channel, removing the channel entry from the
   ChannelManager entirely,
 * persist the ChannelManager,
 * connect a block which contains a fulfill of the HTLC, generating
   a claim event,
 * handle the claim event while the `ChannelMonitor` is being
   persisted,
 * persist the ChannelManager (before the CHannelMonitor is
   persisted fully),
 * restart, reloading the HTLC as a pending payment in the
   ChannelManager, which now has no references to it except from
   the ChannelMonitor which still has the pending HTLC,
 * replay the block connection, generating a duplicate PaymentSent
   event.
ChannelMonitors now require that they be re-persisted before
MonitorEvents be provided to the ChannelManager, the exact thing
that test_dup_htlc_onchain_fails_on_reload was testing for when it
*didn't* happen. As such, test_dup_htlc_onchain_fails_on_reload is
now testing that we bahve correctly when the API guarantees are not
met, something we don't need to do.

Here, we adapt it to test the new API requirements through
ChainMonitor's calls to the Persist trait instead.
If we have a `ChannelMonitor` update from an on-chain event which
returns a `TemporaryFailure`, we block `MonitorEvent`s from that
`ChannelMonitor` until the update is persisted. This prevents
duplicate payment send events to the user after payments get
reloaded from monitors on restart.

However, if the event being avoided isn't going to generate a
PaymentSent, but instead result in us claiming an HTLC from an
upstream channel (ie the HTLC was forwarded), then the result of a
user delaying the event is that we delay getting our money, not a
duplicate event.

Because user persistence may take an arbitrary amount of time, we
need to bound the amount of time we can possibly wait to return
events, which we do here by bounding it to 3 blocks.

Thanks to Val for catching this in review.
Its somewhat confusing that `persist_new_channel` is called on
startup for an existing channel in common deployments, so we call
it out explicitly.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-persist-mon-blocks branch from a3d07e6 to 6fb5bd3 Compare October 20, 2021 00:06
@TheBlueMatt TheBlueMatt merged commit 107c6c7 into lightningdevkit:main Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants