Skip to content

ChannelManager persistence #752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Nov 17, 2020

Closes #743.

Current todos:

  • abstract out utilities for writing to disk/forming platform-agnostic filepaths, etc
  • rename FilesystemPersister to MonitorPersister? Or somehow align the names of the two persisters
  • make the manager persister test Windows-friendly
  • add more unit tests, currently there's just one integration test
  • make sure the ChannelManager signals for a new update whenever it needs to, and doesn't signal when it doesn't need to

@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #752 (a368093) into main (13e990e) will increase coverage by 0.00%.
The diff coverage is 95.69%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #752   +/-   ##
=======================================
  Coverage   90.79%   90.80%           
=======================================
  Files          44       45    +1     
  Lines       24466    24547   +81     
=======================================
+ Hits        22215    22290   +75     
- Misses       2251     2257    +6     
Impacted Files Coverage Δ
lightning-persister/src/lib.rs 92.23% <83.33%> (-4.07%) ⬇️
lightning-persister/src/util.rs 95.83% <95.83%> (ø)
lightning/src/ln/channelmanager.rs 85.24% <97.02%> (+0.33%) ⬆️
lightning/src/util/macro_logger.rs 88.33% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.01% <0.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13e990e...a368093. Read the comment docs.

@jkczyz jkczyz self-requested a review November 17, 2020 23:27
@valentinewallace valentinewallace force-pushed the chanman-persistence branch 3 times, most recently from 3352f40 to cb20f3b Compare November 19, 2020 18:23
@valentinewallace valentinewallace marked this pull request as ready for review November 19, 2020 21:20
@jkczyz
Copy link
Contributor

jkczyz commented Nov 20, 2020

  • abstract out utilities for writing to disk/forming platform-agnostic filepaths, etc
  • rename FilesystemPersister to MonitorPersister? Or somehow align the names of the two persisters

I thought a bit more about this after our offline discussion and reviewing the code. My idea would be to expand FilesystemPersister to cover this functionality but do so by implementing a separate channelmanager::Persist trait. This trait would consist of a provided method and a required method. The provided method would implement any logic common to any persister pertaining to interaction with ChannelManager (e.g., starting a thread and interacting with the condition variable). And then users would implement the required method for the actual persistence (e.g., FilesystemPersister would implement this to persist to disk). The provide method would call the required method as needed.

Thoughts?

@TheBlueMatt
Copy link
Collaborator

The provided method would implement any logic common to any persister pertaining to interaction with ChannelManager (e.g., starting a thread and interacting with the condition variable).

Hmm, what's the value of trait-ing it, then? I admit its definitely nice to have it be a similar API as the Channel/ChainMonitor stuff, but it just seems like additional re-entrancy to user code for what could be a utility function. Maybe if we were moving towards ChannelManager calling "write me to disk now" on some trait the way we do monitor stuff it'd be more symmetric, but I'm not sure we'll ever want to do that so explicitly given there's a lot of reason to do ChannelManager disk writing in the background.

@TheBlueMatt TheBlueMatt added this to the 0.0.13 milestone Nov 21, 2020
@jkczyz
Copy link
Contributor

jkczyz commented Nov 22, 2020

The provided method would implement any logic common to any persister pertaining to interaction with ChannelManager (e.g., starting a thread and interacting with the condition variable).

Hmm, what's the value of trait-ing it, then? I admit its definitely nice to have it be a similar API as the Channel/ChainMonitor stuff, but it just seems like additional re-entrancy to user code for what could be a utility function.

The trait is a means of providing a template method. I'm presuming we (a) don't want users to write the provided logic on their own and (b) want to provide a means for users to customize how the data is persisted.

A utility function that is parameterized generically by something implementing this trait (i.e., without the provided method) would also be a reasonable approach that I'd be happy with. Is there a different way of defining a utility function that you were thinking of? Seems you would still need a trait unless you pass a closure.

Regarding reentrancy, I don't see either of these approaches as being reentrant. Though I suppose it depends on how you are defining reentrancy.

Maybe if we were moving towards ChannelManager calling "write me to disk now" on some trait the way we do monitor stuff it'd be more symmetric, but I'm not sure we'll ever want to do that so explicitly given there's a lot of reason to do ChannelManager disk writing in the background.

That is reentrancy in how I would have thought: ChannelManager calling back to user code. A user directly calling either a utility function or a provided trait method, which in turn call a required trait method is not, IMO. The same as calling a provided method of an Iterator which in turn calls a user-defined next method would not.

Am I missing something about how reentrancy is involved the current use case? Perhaps because a thread is involved?

@valentinewallace
Copy link
Contributor Author

A utility function that is parameterized generically by something implementing this trait (i.e., without the provided method) would also be a reasonable approach that I'd be happy with.

I like this approach. IIUC: so there'd be a channelmanager::Persist trait, with only one method (write_manager or so). Then we'd keep the current ManagerPersister struct (probably renamed) with its start() method, only the start() method would take in an implementer of channelmanager::Persist, and call into that in its background thread when it gets a signal to persist. Is this accurate to what you're suggesting @jkczyz?

Since the current ManagerPersister struct is gonna have other responsibilities besides persisting the manager (i.e. calling timer_chan_freshness_every_min() and possibly persisting router data, it not have the best name right now. Maybe BackgroundProcesser or something like that?

@TheBlueMatt
Copy link
Collaborator

Am I missing something about how reentrancy is involved the current use case? Perhaps because a thread is involved?

I think I didn't really understand your proposal, then. I understood it as "move the thread-spawn logic and all of the "start thinking about keeping the chanman on disk" logic into a trait with a template method but which users could override. That, afaiu, would basically imply just ChannelManager calling that method immediately on startup and...that's it? That would seem needlessly reentrant.

@jkczyz
Copy link
Contributor

jkczyz commented Nov 23, 2020

A utility function that is parameterized generically by something implementing this trait (i.e., without the provided method) would also be a reasonable approach that I'd be happy with.

I like this approach. IIUC: so there'd be a channelmanager::Persist trait, with only one method (write_manager or so). Then we'd keep the current ManagerPersister struct (probably renamed) with its start() method, only the start() method would take in an implementer of channelmanager::Persist, and call into that in its background thread when it gets a signal to persist. Is this accurate to what you're suggesting @jkczyz?

Almost.

Correct as far as the trait definition. However, the aforementioned utility function would be essentially what start is except it would be parameterized with P: channelmanager::Persist (or something that dereferences to it). It would accept a persister: P instead of filename and filepath and use it here in place of calling utils::write_to_file.

Then FilesystemPersister would implement channelmanager::Persist to write to a file. Users could also implement the trait to persist elsewhere (e.g., a database), if they prefer.

Since the current ManagerPersister struct is gonna have other responsibilities besides persisting the manager (i.e. calling timer_chan_freshness_every_min() and possibly persisting router data, it not have the best name right now. Maybe BackgroundProcesser or something like that?

Yeah, I'd imagine this would be named differently. Note that the utility function mentioned above could instead be a BackgroundProcesser struct or such (presumably because it needs to hold the thread_terminator state). Regardless of how it is implemented, the takeaway is start would be parameterized by channelmanager::Persist.

@jkczyz
Copy link
Contributor

jkczyz commented Nov 23, 2020

Am I missing something about how reentrancy is involved the current use case? Perhaps because a thread is involved?

I think I didn't really understand your proposal, then. I understood it as "move the thread-spawn logic and all of the "start thinking about keeping the chanman on disk" logic into a trait with a template method but which users could override. That, afaiu, would basically imply just ChannelManager calling that method immediately on startup and...that's it? That would seem needlessly reentrant.

Let me know if my explanation in #752 (comment) makes it any clearer.

@jkczyz
Copy link
Contributor

jkczyz commented Nov 23, 2020

A utility function that is parameterized generically by something implementing this trait (i.e., without the provided method) would also be a reasonable approach that I'd be happy with.

I like this approach. IIUC: so there'd be a channelmanager::Persist trait, with only one method (write_manager or so). Then we'd keep the current ManagerPersister struct (probably renamed) with its start() method, only the start() method would take in an implementer of channelmanager::Persist, and call into that in its background thread when it gets a signal to persist. Is this accurate to what you're suggesting @jkczyz?

Almost.

Correct as far as the trait definition. However, the aforementioned utility function would be essentially what start is except it would be parameterized with P: channelmanager::Persist (or something that dereferences to it). It would accept a persister: P instead of filename and filepath and use it here in place of calling utils::write_to_file.

Actually, I think I just repeated what you said. :) So what you said was accurate. I misread the part about start taking an implementer of channelmanager::Persist as implementing channelmanager::Persist.

@TheBlueMatt
Copy link
Collaborator

Let me know if my explanation in #752 (comment) makes it any clearer.

Yep! Thanks, that's much clearer. Such a design sounds good to me.

@valentinewallace valentinewallace force-pushed the chanman-persistence branch 3 times, most recently from 20fd381 to 28237a9 Compare November 24, 2020 23:44
@valentinewallace
Copy link
Contributor Author

valentinewallace commented Nov 25, 2020

Making some changes, I'll update this comment when this is good for review again.

Edit: should be good for another review

@valentinewallace valentinewallace force-pushed the chanman-persistence branch 2 times, most recently from d10cabf to 4ad0f94 Compare November 25, 2020 19:49
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Mostly nitpicking.

K: 'static + KeysInterface<ChanKeySigner=ChanSigner>,
F: 'static + FeeEstimator,
L: 'static + Logger,
PM: 'static + Send + Fn(Arc<ChannelManager<ChanSigner, Arc<M>, Arc<T>, Arc<K>, Arc<F>, Arc<L>>>) -> Result<(), std::io::Error>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this function to take an Arc by ownership instead of a &ChannelManager?
nit: ideally we wouldn't require a ChannelManager parameterized by Arcs for every type, though I'm not sure how you'd meed the required 'static liftime without it so its not really a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's giving me this super helpful error atm: https://i.imgur.com/sD3T8A0.png

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, yea, I'd seen that too. i believe its because the callback referenced is expecting the Arc still (cause what its calling is expecting the Arc).

Copy link
Contributor Author

@valentinewallace valentinewallace Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum even when I make FilesystemPersister::persist_manager expect a reference, still this error. I think what I need to do is specify the type of node in callback, so now trying to figure out how to specify type parameters in the callback closure...

Edit: figured it out

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment, then looks good.

}

/// Blocks until ChannelManager needs to be persisted.
pub fn wait(&self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth noting here and above that only one waiter is woken at a time for each event (because of the re-setting the guard bool to false in wait in the PersistenceNotifier).

Copy link
Contributor Author

@valentinewallace valentinewallace Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I tested with multiple listeners just now and all were notified, but maybe it's unreliable and it's just not showing up in testing? I'll add a comment...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, on a multi-core host you may get lucky and the threads both step basically at the same rate, but it's not guaranteed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, no, huh? It swaps the wakeup bool back to false before returning. Can you share your test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to test_background_processor:

diff --git a/background-processor/src/lib.rs b/background-processor/src/lib.rs
index 5d975eb9..7bc76343 100644
--- a/background-processor/src/lib.rs
+++ b/background-processor/src/lib.rs
@@ -203,8 +203,11 @@ mod tests {

                // Initiate the background processors to watch each node.
                let data_dir = nodes[0].persister.get_data_dir();
+               let data_dir1 = nodes[1].persister.get_data_dir();
                let callback = move |node| FilesystemPersister::persist_manager(data_dir.clone(), node);
+               let callback1 = move |node| FilesystemPersister::persist_manager(data_dir1.clone(), node);
                BackgroundProcessor::start(callback, nodes[0].node.clone(), nodes[0].logger.clone());
+               BackgroundProcessor::start(callback1, nodes[0].node.clone(), nodes[0].logger.clone());

                // Go through the channel creation process until each node should have something persisted.
                let tx = open_channel!(nodes[0], nodes[1], 100000);
@@ -239,6 +242,10 @@ mod tests {
                        if !nodes[0].node.get_persistence_condvar_value() { break }
                }

+               let filepath1 = get_full_filepath("test_background_processor_persister_1".to_string(), "manager".to_string());
+               let mut expected_bytes = Vec::new();
+               check_persisted_data!(nodes[0].node, filepath1.clone(), expected_bytes);
+
                // Force-close the channel.
                nodes[0].node.force_close_channel(&OutPoint { txid: tx.txid(), index: 0 }.to_channel_id()).unwrap();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe an open_channel!() call will result in a number of calls to notify, and there's a good chance some of them don't necessarily imply a change in the serialized version of the channelmanager (or, maybe more likely, the delay between a notify wakeup and the write getting the write lock may be enough that the next update runs first).

macro_rules! log_internal {
($logger: expr, $lvl:expr, $($arg:tt)+) => (
$logger.log(&::util::logger::Record::new($lvl, format_args!($($arg)+), module_path!(), file!(), line!()));
$logger.log(&$crate::util::logger::Record::new($lvl, format_args!($($arg)+), module_path!(), file!(), line!()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, how does this work? is crate resolved with reference to the place the macro is defined? I guess if it works, it works, but that surprises me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other major concerns. Just a couple nits.

@TheBlueMatt
Copy link
Collaborator

Hmm, looks like a test hung on windows:

test tests::test_background_processor ... test tests::test_background_processor has been running for over 60 seconds

@valentinewallace
Copy link
Contributor Author

Hmm, looks like a test hung on windows:

test tests::test_background_processor ... test tests::test_background_processor has been running for over 60 seconds

I'm somewhat baffled, never seen this problem before. Best I can guess from reading up about Rust threads a bit is that we were leaving a few threads in detached states in tests, and this can cause weirdness, especially on low-resourced machines such as those that GH is presumably using for CI? Let me know if this seems like a completely implausible explanation.

@jkczyz
Copy link
Contributor

jkczyz commented Feb 13, 2021

Hmm, looks like a test hung on windows:
test tests::test_background_processor ... test tests::test_background_processor has been running for over 60 seconds

I'm somewhat baffled, never seen this problem before. Best I can guess from reading up about Rust threads a bit is that we were leaving a few threads in detached states in tests, and this can cause weirdness, especially on low-resourced machines such as those that GH is presumably using for CI? Let me know if this seems like a completely implausible explanation.

Not sure if this related to threads. I don't think spawned threads need to terminate for a test to complete. More likely related to any busy looping on something that never occurs. Perhaps the FilesystemPersister failed?

Rather than having this test actually persist data and busy wait, could we instead use a channel to coordinate whether the callback method was called? The sender end would be in the callback and the test would use the receiver end to be notified of the callback.

See as an example: https://github.com/rust-lang/rust/blob/32cbc65e6bf793d99dc609d11f4a4c93176cdbe2/library/std/src/sync/barrier/tests.rs

That way we remove all dependencies on FilesystemPersister from this test, which is tested independently.

@valentinewallace
Copy link
Contributor Author

valentinewallace commented Feb 18, 2021

Not sure if this related to threads. I don't think spawned threads need to terminate for a test to complete. More likely related to any busy looping on something that never occurs. Perhaps the FilesystemPersister failed?

Rather than having this test actually persist data and busy wait, could we instead use a channel to coordinate whether the callback method was called? The sender end would be in the callback and the test would use the receiver end to be notified of the callback.

See as an example: https://github.com/rust-lang/rust/blob/32cbc65e6bf793d99dc609d11f4a4c93176cdbe2/library/std/src/sync/barrier/tests.rs

That way we remove all dependencies on FilesystemPersister from this test, which is tested independently.

It did end up being an issue with the FilesystemPersister. I still think it's worth having an integration test, since we wouldn't have found the problem (which is fixed in commit Fix Windows ... -- thanks @TheBlueMatt for the debugging assist!) without it.

let src = PathBuf::from(tmp_filename.clone());
let dst = PathBuf::from(filename_with_path.clone());
if Path::new(&filename_with_path.clone()).exists() {
unsafe {winapi::um::winbase::ReplaceFileW(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ReplaceFileW work even if the destination file doesn't exist? If so, maybe its worth asking if rust upstream should be using that instead of MoveFileExW?

Copy link
Contributor Author

@valentinewallace valentinewallace Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ReplaceFileW work even if the destination file doesn't exist?

I don't think so, or at least I get background_processor tests that never terminate: https://github.com/valentinewallace/rust-lightning/runs/1931376827

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Stupid Windowz.

@valentinewallace valentinewallace force-pushed the chanman-persistence branch 3 times, most recently from 94ee55b to 5c6b319 Compare February 19, 2021 18:17
These will be used in upcoming commits for the BackgroundProcessor
to log.
Windows started giving 'Access is denied' errors after
a few rounds of persistence. This seems to fix it.
This will allow the ChannelManager to signal when it has new
updates to persist, and adds a way for ChannelManager persisters
to be notified when they should re-persist the ChannelManager
to disk/backups.

Feature-gate the wait_timeout function because the core
lightning crate shouldn't depend on wallclock time unless
users opt into it.
Other includes calling timer_chan_freshness_every_minute() and in the
future, possibly persisting channel graph data.

This struct is suitable for things that need to happen periodically and
can happen in the background.
@valentinewallace
Copy link
Contributor Author

Rebased!

@jkczyz
Copy link
Contributor

jkczyz commented Feb 19, 2021

It did end up being an issue with the FilesystemPersister. I still think it's worth having an integration test, since we wouldn't have found the problem (which is fixed in commit Fix Windows ... -- thanks @TheBlueMatt for the debugging assist!) without it.

Ah, I should have asked if the failure was deterministic. :)

My general feeling is if it can be reproduced by a unit test, then that would be preferable to using an integration test for testing two behaviors that could be unit tested separately in different modules. Not sure if that is the case here, but I'm fine with leaving it for a follow-up if so.

@TheBlueMatt TheBlueMatt merged commit 041d7aa into lightningdevkit:main Feb 19, 2021
@jkczyz
Copy link
Contributor

jkczyz commented Feb 19, 2021

ACK a368093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ChannelManager persistence
5 participants