-
Notifications
You must be signed in to change notification settings - Fork 402
Stop decaying liquidity information during scoring #2656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop decaying liquidity information during scoring #2656
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2656 +/- ##
==========================================
+ Coverage 88.64% 88.93% +0.29%
==========================================
Files 115 115
Lines 91894 93489 +1595
Branches 91894 93489 +1595
==========================================
+ Hits 81458 83145 +1687
+ Misses 7953 7885 -68
+ Partials 2483 2459 -24 ☔ View full report in Codecov by Sentry. |
8d184b7
to
f4fac7c
Compare
Fixed the no-std bug, should be good to go now. |
Marking this 119, it turns out we're not decaying our historical buckets properly in at least two cases - (a) While both could be fixed directly, I'd like to at least consider this first. |
match event { | ||
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_failed(path, *scid); | ||
score.payment_path_failed(path, *scid, duration_since_epoch); | ||
}, | ||
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => { | ||
// Reached if the destination explicitly failed it back. We treat this as a successful probe | ||
// because the payment made it all the way to the destination with sufficient liquidity. | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::PaymentPathSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_successful(path); | ||
score.payment_path_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_failed(path, *scid); | ||
score.probe_failed(path, *scid, duration_since_epoch); | ||
}, | ||
_ => return false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this mean channels along recently used paths will have their offsets decayed but other channels will not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather the opposite - by the end of the patchset, we only decay in the timer method. When updating we just set the last-update to duration_since_epoch
. In theory if a channel is updated in between each timer tick it won't be materially decayed, but I think that's kinda okay, I mean its not a lot of time anyway. If we want to be more pedantically correct I could decay the old data before update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm confused, but it looks like we only decay once per hour in the background processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus once on startup. I'm not understanding your issue you're raising, are you saying we should reduce the hour to something less?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated
and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that said, I'm not really convinced either is a super critical issue, at least if we decay more often, at max we'd be off by a small part of a half-life.
Hmm... if one offset is updated frequently, you'll get into a state where the other offset is only ever partially decayed even though it may have been given that value many half-lives ago. So would really depend on both payment and decay frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're regularly sending some sats over a channel successfully, so we're constantly reducing our upper bound by the amount we're sending, I think its fine to not decay the lower bound? We'll eventually pick some other channel to send over cause we ran out of estimated liquidity, and we'll decay at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, that's not the only scenario. Failures at a channel and downstream from it adjust it's upper and lower bounds, respectively. So if you fail downstream with increasing amounts, the upper bound may not be properly decayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but presumably repeatedly failing downstream of a channel with higher and higher amounts isn't super likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily for the same payment or at the same downstream channel. From the perspective of the scored channel, it's simply the normal case of learning a more accurate lower bound on its liquidity as a consequence of knowing a payment routed through it but failed downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove
Time
, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.
Hmmm... (a) can be avoided if we use a Duration
since we have Time::duration_since_epoch
. (b) seem negligible. And I'm not entirely convinced on (c) regarding the API as now Duration
is used in the mutable but not the non-mutable interface, which isn't vert intuitive in places (see comments). Also, there's the risk of adding new bugs.
lightning/src/routing/scoring.rs
Outdated
*self.last_updated = duration_since_epoch; | ||
*self.offset_history_last_updated = duration_since_epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change last_updated
, max_liquidity_offset_msat
needs to be decayed. Likewise, for the buckets when changing offset_history_last_updated
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, mostly, lets discuss on your first comment at #2656 (comment)
lightning/src/routing/scoring.rs
Outdated
fn set_max_liquidity_msat(&mut self, amount_msat: u64, duration_since_epoch: Duration) { | ||
*self.max_liquidity_offset_msat = self.capacity_msat.checked_sub(amount_msat).unwrap_or(0); | ||
*self.min_liquidity_offset_msat = if amount_msat < self.min_liquidity_msat() { | ||
0 | ||
} else { | ||
self.decayed_offset_msat(*self.min_liquidity_offset_msat) | ||
}; | ||
*self.last_updated = self.now; | ||
if amount_msat < *self.min_liquidity_offset_msat { | ||
*self.min_liquidity_offset_msat = 0; | ||
} | ||
*self.last_updated = duration_since_epoch; | ||
*self.offset_history_last_updated = duration_since_epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise for min_liquidity_offset_msat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, mostly, lets discuss on your first comment at #2656 (comment)
let existing_max_msat = self.max_liquidity_msat(); | ||
if amount_msat < existing_max_msat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unintuitive that we compare against and un-decayed value even though we have the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think we can consider the undecayed values canonical for channels we're updating often, but we can discuss more on your first thread at Yea, mostly, lets discuss on your first comment at #2656 (comment)
/// Adjusts the channel liquidity balance bounds when failing to route `amount_msat`. | ||
fn failed_at_channel<Log: Deref>(&mut self, amount_msat: u64, chan_descr: fmt::Arguments, logger: &Log) where Log::Target: Logger { | ||
fn failed_at_channel<Log: Deref>( | ||
&mut self, amount_msat: u64, duration_since_epoch: Duration, chan_descr: fmt::Arguments, logger: &Log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotta say, I really don't like the incongruity of passing the current time in the mutable interface but not in the non-mutable one, which can't be avoided with this approach. That makes uses of the non-mutable interface from the mutable interface harder to reason about. I much prefer the approach of passing the current time to DirectedChannelLiquidity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I see why its annoying to have another parameter, but I kinda disagree about it belonging in DirectedChannelLiquidity
. Everything else in DirectedChannelLiquidity
is just a reference to liquidity data for a single channel, plus a reference to the decay settings of the overall scorer. The current time doesn't fit into either of those, and is information about the failed payment, which is otherwise all arguments to the failed/success methods.
match event { | ||
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_failed(path, *scid); | ||
score.payment_path_failed(path, *scid, duration_since_epoch); | ||
}, | ||
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => { | ||
// Reached if the destination explicitly failed it back. We treat this as a successful probe | ||
// because the payment made it all the way to the destination with sufficient liquidity. | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::PaymentPathSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_successful(path); | ||
score.payment_path_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_failed(path, *scid); | ||
score.probe_failed(path, *scid, duration_since_epoch); | ||
}, | ||
_ => return false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated
and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.
lightning/src/routing/scoring.rs
Outdated
*self.max_liquidity_offset_msat = 0; | ||
} | ||
*self.last_updated = duration_since_epoch; | ||
*self.offset_history_last_updated = duration_since_epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update offset_history_last_updated
in update_history_buckets
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch, fixed.
lightning/src/routing/scoring.rs
Outdated
let half_lives = self.now.duration_since(*self.last_updated).as_secs() | ||
fn update_history_buckets(&mut self, bucket_offset_msat: u64, duration_since_epoch: Duration) { | ||
let half_lives = | ||
duration_since_epoch.checked_sub(*self.offset_history_last_updated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't accurate given that offset_history_last_updated
is updated in set_min_liquidity_msat
and set_max_liquidity_msat
which could (but may not) be called prior to calling update_history_buckets
. Do we have tests to catch this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't have really good testing of these kinds of issues, as evidenced also by your bugfix at #2530. Luckily, doing the decaying in the background means this isn't actually a concern anymore - we only care about this in the rare case that we need to decay the buckets now, but havent run the decayer yet, and then we get a new datapoint. But, that doesn't really matter cause that's no difference than just increasing the half-life by a few minutes, which shouldn't really matter at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, just went ahead and removed the half-life-based decay here, there's really no reason for it and we should just rely on the one in decay_liquidity_certainty
.
0498a13
to
60be6f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial pass (finally, excuse the delay).
@@ -274,7 +274,7 @@ macro_rules! define_run_body { | |||
$channel_manager: ident, $process_channel_manager_events: expr, | |||
$gossip_sync: ident, $peer_manager: ident, $logger: ident, $scorer: ident, | |||
$loop_exit_check: expr, $await: expr, $get_timer: expr, $timer_elapsed: expr, | |||
$check_slow_await: expr) | |||
$check_slow_await: expr, $time_fetch: expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than just making this a closure, could we introduce a TimeSource
trait and add a default impl based on SystemTime
? It seems that we regularly need to retrieve the time in some no-std compatible way and it might be nice to make this generic (e.g., we'd also want something like that in lightning-liquidity
https://github.com/lightningdevkit/lightning-liquidity/issues/54)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd really prefer to have a million copies of "CurrentTimeFetcher" in the API everywhere, and for the most part we don't need it - we can mostly just use the latest block timestamp and call it a day (and in a few places use timer ticks if we want more granular expiry). Part of the goal of this PR is to move towards removing the Time
trait, which I think is really unnecessary (and also no longer depending on it being blazing fast for routing perf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why having a trait is directly connected to using it in scoring? We could still have a trait-based impl that is called in the background processor? Just brought it up as in lighting-liquidity
we probably want to store a reference to the time source the user will give us upon setup, and it might make sense to have that generic cross compatible with other LDK crates that require the same functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think my point is mostly "why can't lightning-liquidity
use the block timestamp rather than the time?".
Addressed feedback and rebased. |
60be6f9
to
507b842
Compare
This unfortunately needs a rebase now. |
2338237
to
2dce56d
Compare
Rebased on top of #2774, since it needs to go anyway. |
@@ -773,7 +796,10 @@ impl BackgroundProcessor { | |||
handle_network_graph_update(network_graph, &event) | |||
} | |||
if let Some(ref scorer) = scorer { | |||
if update_scorer(scorer, &event) { | |||
use std::time::SystemTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this functions uses system time now, it should probably be #[cfg(all(feature = "std"), not(feature = "no-std")]
to handle when you'll be able to use both flags together in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just feature=std is correct. If two crates depend on LDK, with one setting std and another setting no-std, LDK should build with all features. Otherwise, the create relying on std features will fail to compile because of an unrelated crate also in the dependency tree.
2dce56d
to
3711994
Compare
Rebased now that #2774 landed. |
3711994
to
506c068
Compare
match event { | ||
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_failed(path, *scid); | ||
score.payment_path_failed(path, *scid, duration_since_epoch); | ||
}, | ||
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => { | ||
// Reached if the destination explicitly failed it back. We treat this as a successful probe | ||
// because the payment made it all the way to the destination with sufficient liquidity. | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::PaymentPathSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_successful(path); | ||
score.payment_path_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_failed(path, *scid); | ||
score.probe_failed(path, *scid, duration_since_epoch); | ||
}, | ||
_ => return false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, decaying more often helps. For me it's more about consistency within our model -- last_updated
no longer has a well-defined meaning as it may only be accurate for one offset. So we have to chose between internal consistency for a channel and consistency across channels with this approach.
@@ -114,7 +114,7 @@ const ONION_MESSAGE_HANDLER_TIMER: u64 = 1; | |||
const NETWORK_PRUNE_TIMER: u64 = 60 * 60; | |||
|
|||
#[cfg(not(test))] | |||
const SCORER_PERSIST_TIMER: u64 = 60 * 60; | |||
const SCORER_PERSIST_TIMER: u64 = 60 * 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should use a constant here. It should be no more than the user-defined half-life, ideally such that the half-life is divisible by it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess? If a user sets an aggressive half-life I'm not entirely convinced we want to spin their CPU trying to decay liquidity bounds. Doing it a bit too often when they set a super high decay also seems fine-ish? I agree it'd be a bit nicer to switch to some function of the configured half-life, but I'm not sure its worth adding some accessor to ScoreUpdate
.
@@ -1700,7 +1732,7 @@ mod tests { | |||
_ = exit_receiver.changed() => true, | |||
} | |||
}) | |||
}, false, | |||
}, false, || Some(Duration::from_secs(1696300000)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's behind the choice of this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its, basically, when I wrote the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why it can't be Duration::ZERO
like in the other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it just seemed a bit more realistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the value doesn't affect the test, it's just curious to the reader to see something different from all the other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I tried to switch to ZERO but the test fails - it expects to prune entries from the network graph against a static RGS snapshot that has a timestamp in it.
On a skylake system, 10k samples in bench gives me these changes for this branch. There's quite a bit of noise, as usual, but it does look like a non-zero win.
|
match event { | ||
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_failed(path, *scid); | ||
score.payment_path_failed(path, *scid, duration_since_epoch); | ||
}, | ||
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => { | ||
// Reached if the destination explicitly failed it back. We treat this as a successful probe | ||
// because the payment made it all the way to the destination with sufficient liquidity. | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::PaymentPathSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.payment_path_successful(path); | ||
score.payment_path_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeSuccessful { path, .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_successful(path); | ||
score.probe_successful(path, duration_since_epoch); | ||
}, | ||
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => { | ||
let mut score = scorer.write_lock(); | ||
score.probe_failed(path, *scid); | ||
score.probe_failed(path, *scid, duration_since_epoch); | ||
}, | ||
_ => return false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More in the sense that its use in decaying isn't well defined. We should at least note that in the decay_liquidity_certainty
implementation.
@@ -1700,7 +1732,7 @@ mod tests { | |||
_ = exit_receiver.changed() => true, | |||
} | |||
}) | |||
}, false, | |||
}, false, || Some(Duration::from_secs(1696300000)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why it can't be Duration::ZERO
like in the other tests?
let half_life = decay_params.historical_no_updates_half_life.as_secs_f64(); | ||
if half_life != 0.0 { | ||
let divisor = powf64(2048.0, elapsed_time.as_secs_f64() / half_life) as u64; | ||
for bucket in liquidity.min_liquidity_offset_history.buckets.iter_mut() { | ||
*bucket = ((*bucket as u64) * 1024 / divisor) as u16; | ||
} | ||
for bucket in liquidity.max_liquidity_offset_history.buckets.iter_mut() { | ||
*bucket = ((*bucket as u64) * 1024 / divisor) as u16; | ||
} | ||
liquidity.offset_history_last_updated = duration_since_epoch; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this means we'll decay partial half-lives but only after decaying one full half-life. Why bother with using 2048.0
and 1024
here if this is happening in the background?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those multipliers are just to get reasonable precision. We could cast the bucket to a float and then do the whole thing in float math, but it seems easier to just keep the buckets as ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not do partial decays when less than one half-life has passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kinda goes against the model of the historical buckets - they're intended to be "time-free", only using a decay parameter if we really haven't seen that channel in a long time. Now, I wouldn't be against revisiting that idea, its quite possible we over-corrected from having too much of a time parameter in the non-historical data, but I'd like to think about that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, SGTM.
In the next commits we'll need `f64`'s `powf`, which is only available in `std`. For `no-std`, here we depend on `libm` (a `rust-lang` org project), which we can use for `powf`.
In the coming commits, we'll stop relying on fetching the time during routefetching, preferring to decay score data in the background instead. The first step towards this - passing the current time through into the scorer when updating.
506c068
to
bbaf6b8
Compare
d9b0f37
to
238c459
Compare
Rather than relying on fetching the current time during routefinding, here we introduce a new trait method to `ScoreUpdate` to do so. This largely mirrors what we do with the `NetworkGraph`, and allows us to take on much more expensive operations (floating point exponentiation) in our decaying.
In the next commit, we'll start to use the new `ScoreUpdate::decay_liquidity_certainty` to decay our bounds in the background. This will result in the `last_updated` field getting updated regularly on decay, rather than only on update. While this isn't an issue for the regular liquidity bounds, it poses a problem for the historical liquidity buckets, which are decayed on a separate (and by default much longer) timer. If we didn't move to tracking their decays separately, we'd never let the `last_updated` field get old enough for the historical buckets to decay at all. Instead, here we introduce a new `Duration` in the `ChannelLiquidity` which tracks the last time the historical liquidity buckets were last updated. We initialize it to a copy of `last_updated` on deserialization if it is missing.
This implements decaying in the `ProbabilisticScorer`'s `ScoreLookup::decay_liquidity_certainty` implementation, using floats for accuracy since we're no longer particularly time-sensitive. Further, it (finally) removes score entries which have decayed to zero.
Because scoring is an incredibly performance-sensitive operation, doing liquidity information decay (and especially fetching the current time!) during scoring isn't really a great idea. Now that we decay liquidity information in the background, we don't have any reason to decay during scoring, and we remove the historical bucket liquidity decaying here.
Because scoring is an incredibly performance-sensitive operation, doing liquidity information decay (and especially fetching the current time!) during scoring isn't really a great idea. Now that we decay liquidity information in the background, we don't have any reason to decay during scoring, and we ultimately remove it entirely here.
Now that we aren't decaying during scoring, when we set the last_updated time in the history bucket logic doesn't matter, so we should just update it when we've just updated the history buckets.
In the coming commits, the `T: Time` bound on `ProbabilisticScorer` will be removed. In order to enable that, we need to pass the current time (as a `Duration` since the unix epoch) through the score updating pipeline, allowing us to keep the `*last_updated_time` fields up-to-date as we go.
In the coming commits, the `T: Time` bound on `ProbabilisticScorer` will be removed. In order to enable that, we need to switch over to using the `ScoreUpdate`-provided current time (as a `Duration` since the unix epoch), making the `T` bound entirely unused.
Now that we don't access time via the `Time` trait in `ProbabilisticScorer`, we can finally drop the `Time` bound entirely, removing the `ProbabilisticScorerUsingTime` and type alias indirection and replacing it with a simple struct.
As we now no longer decay bounds information when fetching them, there is no need to have a decaying-fetching helper utility.
This is a good gut-check to ensure we don't end up taking a ton of time decaying channel liquidity info. It currently clocks in around 1.25ms on an i7-1360P.
Now that the serialization format of `no-std` and `std` `ProbabilisticScorer`s both just use `Duration` since UNIX epoch and don't care about time except when decaying, we don't need to warn users to not mix the scorers across `no-std` and `std` flags. Fixes lightningdevkit#2539
There's some edge cases in our scoring when the information really should be decayed but hasn't yet been prior to an update. Rather than try to fix them exactly, we instead decay the scorer a bit more often, which largely solves them but also gives us a bit more accurate bounds on our channels, allowing us to reuse channels at a similar amount to what just failed immediately, but at a substantial penalty.
Because we decay the bucket information in the background, there's not much reason to try to decay them immediately prior to updating, and in removing that we can also clean up a good bit of dead code, which we do here.
Now that we use explicit times passed to decay methods, there's no reason to make calls to `SinceEpoch::advance` in scoring tests.
238c459
to
f8fb70a
Compare
Squashed the fixups, without any changes. |
Code overall looks good. I'm not opposed to the approach, necessarily. I think there still can be an issue decaying in the case of a node with a large payment volume (See #2656 (comment)). Otherwise, I don't have any other concerns. @tnull Could you take another look? |
Yea, I mean its definitely not "right", just not clear to me its "wrong" either. The only thing we could really do to address it is split the last_updated fields into two. I'm totally fine doing that if you think its worth it. |
4e9783e
to
f8fb70a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass.
LGTM I think. ACKing in case we want to land this soon, but might come back for yet another pass next week.
I guess it just can come across as surprising to an outside observer. They can tell that the a bound has been adjusted, but one of them doesn't seem to be decaying as expected based on the configuration and the observed time of adjustment. Maybe it's not horrible in practice? |
I guess to an outside observer it just looks like both ends got updated? Which isn't true, but not crazy broken either (at least in the sense that I'm not sure what outside observer would be looking at both their failures and their success/failure stream and somehow caring about the inconsistent decays). I'm happy with either solution, though - as-is or splitting the last-updated tracking. We could also split the last-updated tracking in a followup, it'd be a new commit either way. |
Was thinking in terms of something like But, yeah, a follow-up is fine. Seems like we can always do it later if we find problems, too, given the serialization format. Just seems more correct to use separate timestamps. |
Alright, gonna merge this. I have a large pile of performance tweaks to the router and scorer up next, and I can incorporate a separate decay for the two bounds in some of that work. |
usage.inflight_htlc_msat = 0; | ||
assert_eq!(scorer.channel_penalty_msat(&candidate, usage, ¶ms), 866); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just fixing a build warning and noticed this. Why did this check need to be removed? Deleted in 35b4964.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous test relied on the behavior where we actually used undecayed data in the buckets when scoring, and only considered the decaying when deciding if we should score at all. We now actually decay the data so do not have the undecayed data available.
Because scoring is an incredibly performance-sensitive operation,
doing liquidity information decay (and especially fetching the
current time!) during scoring isn't really a great idea. Instead, this PR moves to handling decaying in a background processor job.
This should fix #2311, which apparently is still an issue for some users as of 0.0.116.
There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove
Time
, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.