-
Notifications
You must be signed in to change notification settings - Fork 401
ChannelManager persistence #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChannelManager persistence #752
Conversation
77a29b3
to
1d49404
Compare
Codecov Report
@@ Coverage Diff @@
## main #752 +/- ##
=======================================
Coverage 90.79% 90.80%
=======================================
Files 44 45 +1
Lines 24466 24547 +81
=======================================
+ Hits 22215 22290 +75
- Misses 2251 2257 +6
Continue to review full report at Codecov.
|
3352f40
to
cb20f3b
Compare
I thought a bit more about this after our offline discussion and reviewing the code. My idea would be to expand Thoughts? |
Hmm, what's the value of trait-ing it, then? I admit its definitely nice to have it be a similar API as the |
The trait is a means of providing a template method. I'm presuming we (a) don't want users to write the provided logic on their own and (b) want to provide a means for users to customize how the data is persisted. A utility function that is parameterized generically by something implementing this trait (i.e., without the provided method) would also be a reasonable approach that I'd be happy with. Is there a different way of defining a utility function that you were thinking of? Seems you would still need a trait unless you pass a closure. Regarding reentrancy, I don't see either of these approaches as being reentrant. Though I suppose it depends on how you are defining reentrancy.
That is reentrancy in how I would have thought: Am I missing something about how reentrancy is involved the current use case? Perhaps because a thread is involved? |
I like this approach. IIUC: so there'd be a Since the current |
I think I didn't really understand your proposal, then. I understood it as "move the thread-spawn logic and all of the "start thinking about keeping the chanman on disk" logic into a trait with a template method but which users could override. That, afaiu, would basically imply just |
Almost. Correct as far as the trait definition. However, the aforementioned utility function would be essentially what Then
Yeah, I'd imagine this would be named differently. Note that the utility function mentioned above could instead be a |
Let me know if my explanation in #752 (comment) makes it any clearer. |
Actually, I think I just repeated what you said. :) So what you said was accurate. I misread the part about |
Yep! Thanks, that's much clearer. Such a design sounds good to me. |
20fd381
to
28237a9
Compare
Making some changes, I'll update this comment when this is good for review again. Edit: should be good for another review |
d10cabf
to
4ad0f94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Mostly nitpicking.
background-processor/src/lib.rs
Outdated
K: 'static + KeysInterface<ChanKeySigner=ChanSigner>, | ||
F: 'static + FeeEstimator, | ||
L: 'static + Logger, | ||
PM: 'static + Send + Fn(Arc<ChannelManager<ChanSigner, Arc<M>, Arc<T>, Arc<K>, Arc<F>, Arc<L>>>) -> Result<(), std::io::Error>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason for this function to take an Arc by ownership instead of a &ChannelManager
?
nit: ideally we wouldn't require a ChannelManager
parameterized by Arcs for every type, though I'm not sure how you'd meed the required 'static
liftime without it so its not really a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's giving me this super helpful error atm: https://i.imgur.com/sD3T8A0.png
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, yea, I'd seen that too. i believe its because the callback
referenced is expecting the Arc
still (cause what its calling is expecting the Arc
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum even when I make FilesystemPersister::persist_manager
expect a reference, still this error. I think what I need to do is specify the type of node
in callback
, so now trying to figure out how to specify type parameters in the callback
closure...
Edit: figured it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment, then looks good.
} | ||
|
||
/// Blocks until ChannelManager needs to be persisted. | ||
pub fn wait(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth noting here and above that only one waiter is woken at a time for each event (because of the re-setting the guard bool to false in wait in the PersistenceNotifier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I tested with multiple listeners just now and all were notified, but maybe it's unreliable and it's just not showing up in testing? I'll add a comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, on a multi-core host you may get lucky and the threads both step basically at the same rate, but it's not guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, no, huh? It swaps the wakeup bool back to false before returning. Can you share your test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to test_background_processor
:
diff --git a/background-processor/src/lib.rs b/background-processor/src/lib.rs
index 5d975eb9..7bc76343 100644
--- a/background-processor/src/lib.rs
+++ b/background-processor/src/lib.rs
@@ -203,8 +203,11 @@ mod tests {
// Initiate the background processors to watch each node.
let data_dir = nodes[0].persister.get_data_dir();
+ let data_dir1 = nodes[1].persister.get_data_dir();
let callback = move |node| FilesystemPersister::persist_manager(data_dir.clone(), node);
+ let callback1 = move |node| FilesystemPersister::persist_manager(data_dir1.clone(), node);
BackgroundProcessor::start(callback, nodes[0].node.clone(), nodes[0].logger.clone());
+ BackgroundProcessor::start(callback1, nodes[0].node.clone(), nodes[0].logger.clone());
// Go through the channel creation process until each node should have something persisted.
let tx = open_channel!(nodes[0], nodes[1], 100000);
@@ -239,6 +242,10 @@ mod tests {
if !nodes[0].node.get_persistence_condvar_value() { break }
}
+ let filepath1 = get_full_filepath("test_background_processor_persister_1".to_string(), "manager".to_string());
+ let mut expected_bytes = Vec::new();
+ check_persisted_data!(nodes[0].node, filepath1.clone(), expected_bytes);
+
// Force-close the channel.
nodes[0].node.force_close_channel(&OutPoint { txid: tx.txid(), index: 0 }.to_channel_id()).unwrap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe an open_channel!()
call will result in a number of calls to notify, and there's a good chance some of them don't necessarily imply a change in the serialized version of the channelmanager (or, maybe more likely, the delay between a notify wakeup and the write getting the write lock may be enough that the next update runs first).
macro_rules! log_internal { | ||
($logger: expr, $lvl:expr, $($arg:tt)+) => ( | ||
$logger.log(&::util::logger::Record::new($lvl, format_args!($($arg)+), module_path!(), file!(), line!())); | ||
$logger.log(&$crate::util::logger::Record::new($lvl, format_args!($($arg)+), module_path!(), file!(), line!())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, how does this work? is crate
resolved with reference to the place the macro is defined? I guess if it works, it works, but that surprises me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it 100% but I think it's the special magic explained here: http://web.mit.edu/rust-lang_v1.25/arch/amd64_ubuntu1404/share/doc/rust/html/book/first-edition/macros.html#the-variable-crate
d9459e0
to
3c95a48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other major concerns. Just a couple nits.
3c95a48
to
4ddc992
Compare
Hmm, looks like a test hung on windows:
|
4ddc992
to
25e40d5
Compare
I'm somewhat baffled, never seen this problem before. Best I can guess from reading up about Rust threads a bit is that we were leaving a few threads in |
Not sure if this related to threads. I don't think spawned threads need to terminate for a test to complete. More likely related to any busy looping on something that never occurs. Perhaps the Rather than having this test actually persist data and busy wait, could we instead use a channel to coordinate whether the callback method was called? The sender end would be in the callback and the test would use the receiver end to be notified of the callback. See as an example: https://github.com/rust-lang/rust/blob/32cbc65e6bf793d99dc609d11f4a4c93176cdbe2/library/std/src/sync/barrier/tests.rs That way we remove all dependencies on |
25e40d5
to
f9832b2
Compare
It did end up being an issue with the |
let src = PathBuf::from(tmp_filename.clone()); | ||
let dst = PathBuf::from(filename_with_path.clone()); | ||
if Path::new(&filename_with_path.clone()).exists() { | ||
unsafe {winapi::um::winbase::ReplaceFileW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ReplaceFileW
work even if the destination file doesn't exist? If so, maybe its worth asking if rust upstream should be using that instead of MoveFileExW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ReplaceFileW work even if the destination file doesn't exist?
I don't think so, or at least I get background_processor
tests that never terminate: https://github.com/valentinewallace/rust-lightning/runs/1931376827
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Stupid Windowz.
94ee55b
to
5c6b319
Compare
These will be used in upcoming commits for the BackgroundProcessor to log.
Windows started giving 'Access is denied' errors after a few rounds of persistence. This seems to fix it.
This will allow the ChannelManager to signal when it has new updates to persist, and adds a way for ChannelManager persisters to be notified when they should re-persist the ChannelManager to disk/backups. Feature-gate the wait_timeout function because the core lightning crate shouldn't depend on wallclock time unless users opt into it.
Other includes calling timer_chan_freshness_every_minute() and in the future, possibly persisting channel graph data. This struct is suitable for things that need to happen periodically and can happen in the background.
5c6b319
to
a368093
Compare
Rebased! |
Ah, I should have asked if the failure was deterministic. :) My general feeling is if it can be reproduced by a unit test, then that would be preferable to using an integration test for testing two behaviors that could be unit tested separately in different modules. Not sure if that is the case here, but I'm fine with leaving it for a follow-up if so. |
ACK a368093 |
Closes #743.
Current todos:
FilesystemPersister
toMonitorPersister
? Or somehow align the names of the two persistersChannelManager
signals for a new update whenever it needs to, and doesn't signal when it doesn't need to