Stop decaying liquidity information during scoring #2656

TheBlueMatt · 2023-10-09T20:40:38Z

Because scoring is an incredibly performance-sensitive operation,
doing liquidity information decay (and especially fetching the
current time!) during scoring isn't really a great idea. Instead, this PR moves to handling decaying in a background processor job.

This should fix #2311, which apparently is still an issue for some users as of 0.0.116.

There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove Time, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.

codecov-commenter · 2023-10-09T20:53:55Z

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (9856fb6) 88.64% compared to head (f8fb70a) 88.93%.
Report is 4 commits behind head on main.

Files	Patch %	Lines
lightning/src/routing/scoring.rs	92.69%	18 Missing and 1 partial ⚠️
lightning/src/util/test_utils.rs	0.00%	5 Missing ⚠️
lightning-background-processor/src/lib.rs	91.17%	0 Missing and 3 partials ⚠️
lightning/src/routing/router.rs	0.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2656      +/-   ##
==========================================
+ Coverage   88.64%   88.93%   +0.29%     
==========================================
  Files         115      115              
  Lines       91894    93489    +1595     
  Branches    91894    93489    +1595     
==========================================
+ Hits        81458    83145    +1687     
+ Misses       7953     7885      -68     
+ Partials     2483     2459      -24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TheBlueMatt · 2023-11-01T02:44:15Z

Fixed the no-std bug, should be good to go now.

TheBlueMatt · 2023-11-15T00:52:04Z

Marking this 119, it turns out we're not decaying our historical buckets properly in at least two cases - (a) get_total_valid_points only shifts once, but the data is squared, so should be shifting twice, (b) the issue fixed alternatively in #2530.

While both could be fixed directly, I'd like to at least consider this first.

jkczyz · 2023-11-15T17:55:12Z

lightning-background-processor/src/lib.rs

 	match event {
 		Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_failed(path, *scid);
+			score.payment_path_failed(path, *scid, duration_since_epoch);
 		},
 		Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
 			// Reached if the destination explicitly failed it back. We treat this as a successful probe
 			// because the payment made it all the way to the destination with sufficient liquidity.
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::PaymentPathSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_successful(path);
+			score.payment_path_successful(path, duration_since_epoch);
 		},
 		Event::ProbeSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_failed(path, *scid);
+			score.probe_failed(path, *scid, duration_since_epoch);
 		},
 		_ => return false,
 	}


Won't this mean channels along recently used paths will have their offsets decayed but other channels will not?

Rather the opposite - by the end of the patchset, we only decay in the timer method. When updating we just set the last-update to duration_since_epoch. In theory if a channel is updated in between each timer tick it won't be materially decayed, but I think that's kinda okay, I mean its not a lot of time anyway. If we want to be more pedantically correct I could decay the old data before update.

Maybe I'm confused, but it looks like we only decay once per hour in the background processor.

Plus once on startup. I'm not understanding your issue you're raising, are you saying we should reduce the hour to something less?

Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.

All that said, I'm not really convinced either is a super critical issue, at least if we decay more often, at max we'd be off by a small part of a half-life.

Hmm... if one offset is updated frequently, you'll get into a state where the other offset is only ever partially decayed even though it may have been given that value many half-lives ago. So would really depend on both payment and decay frequency.

If we're regularly sending some sats over a channel successfully, so we're constantly reducing our upper bound by the amount we're sending, I think its fine to not decay the lower bound? We'll eventually pick some other channel to send over cause we ran out of estimated liquidity, and we'll decay at that point.

FWIW, that's not the only scenario. Failures at a channel and downstream from it adjust it's upper and lower bounds, respectively. So if you fail downstream with increasing amounts, the upper bound may not be properly decayed.

Right, but presumably repeatedly failing downstream of a channel with higher and higher amounts isn't super likely.

Not necessarily for the same payment or at the same downstream channel. From the perspective of the scored channel, it's simply the normal case of learning a more accurate lower bound on its liquidity as a consequence of knowing a payment routed through it but failed downstream.

jkczyz

There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove Time, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.

Hmmm... (a) can be avoided if we use a Duration since we have Time::duration_since_epoch. (b) seem negligible. And I'm not entirely convinced on (c) regarding the API as now Duration is used in the mutable but not the non-mutable interface, which isn't vert intuitive in places (see comments). Also, there's the risk of adding new bugs.

lightning/src/routing/scoring.rs

jkczyz · 2023-11-28T15:35:30Z

lightning/src/routing/scoring.rs

+		*self.last_updated = duration_since_epoch;
+		*self.offset_history_last_updated = duration_since_epoch;


If you change last_updated, max_liquidity_offset_msat needs to be decayed. Likewise, for the buckets when changing offset_history_last_updated, right?

Yea, mostly, lets discuss on your first comment at #2656 (comment)

jkczyz · 2023-11-28T15:35:50Z

lightning/src/routing/scoring.rs

+	fn set_max_liquidity_msat(&mut self, amount_msat: u64, duration_since_epoch: Duration) {
 		*self.max_liquidity_offset_msat = self.capacity_msat.checked_sub(amount_msat).unwrap_or(0);
-		*self.min_liquidity_offset_msat = if amount_msat < self.min_liquidity_msat() {
-			0
-		} else {
-			self.decayed_offset_msat(*self.min_liquidity_offset_msat)
-		};
-		*self.last_updated = self.now;
+		if amount_msat < *self.min_liquidity_offset_msat {
+			*self.min_liquidity_offset_msat = 0;
+		}
+		*self.last_updated = duration_since_epoch;
+		*self.offset_history_last_updated = duration_since_epoch;


Likewise for min_liquidity_offset_msat.

Yea, mostly, lets discuss on your first comment at #2656 (comment)

jkczyz · 2023-11-28T15:41:58Z

lightning/src/routing/scoring.rs

 		let existing_max_msat = self.max_liquidity_msat();
 		if amount_msat < existing_max_msat {


It's a bit unintuitive that we compare against and un-decayed value even though we have the time.

Similarly, I think we can consider the undecayed values canonical for channels we're updating often, but we can discuss more on your first thread at Yea, mostly, lets discuss on your first comment at #2656 (comment)

jkczyz · 2023-11-28T15:48:46Z

lightning/src/routing/scoring.rs

 	/// Adjusts the channel liquidity balance bounds when failing to route `amount_msat`.
-	fn failed_at_channel<Log: Deref>(&mut self, amount_msat: u64, chan_descr: fmt::Arguments, logger: &Log) where Log::Target: Logger {
+	fn failed_at_channel<Log: Deref>(
+		&mut self, amount_msat: u64, duration_since_epoch: Duration, chan_descr: fmt::Arguments, logger: &Log


Gotta say, I really don't like the incongruity of passing the current time in the mutable interface but not in the non-mutable one, which can't be avoided with this approach. That makes uses of the non-mutable interface from the mutable interface harder to reason about. I much prefer the approach of passing the current time to DirectedChannelLiquidity.

Yea, I see why its annoying to have another parameter, but I kinda disagree about it belonging in DirectedChannelLiquidity. Everything else in DirectedChannelLiquidity is just a reference to liquidity data for a single channel, plus a reference to the decay settings of the overall scorer. The current time doesn't fit into either of those, and is information about the failed payment, which is otherwise all arguments to the failed/success methods.

jkczyz · 2023-11-28T15:55:42Z

lightning-background-processor/src/lib.rs

 	match event {
 		Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_failed(path, *scid);
+			score.payment_path_failed(path, *scid, duration_since_epoch);
 		},
 		Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
 			// Reached if the destination explicitly failed it back. We treat this as a successful probe
 			// because the payment made it all the way to the destination with sufficient liquidity.
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::PaymentPathSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_successful(path);
+			score.payment_path_successful(path, duration_since_epoch);
 		},
 		Event::ProbeSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_failed(path, *scid);
+			score.probe_failed(path, *scid, duration_since_epoch);
 		},
 		_ => return false,
 	}


Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.

jkczyz · 2023-11-28T15:57:26Z

lightning/src/routing/scoring.rs

+			*self.max_liquidity_offset_msat = 0;
+		}
+		*self.last_updated = duration_since_epoch;
+		*self.offset_history_last_updated = duration_since_epoch;


Should we update offset_history_last_updated in update_history_buckets instead?

Yes, good catch, fixed.

jkczyz · 2023-11-28T16:03:17Z

lightning/src/routing/scoring.rs

-		let half_lives = self.now.duration_since(*self.last_updated).as_secs()
+	fn update_history_buckets(&mut self, bucket_offset_msat: u64, duration_since_epoch: Duration) {
+		let half_lives =
+			duration_since_epoch.checked_sub(*self.offset_history_last_updated)


This isn't accurate given that offset_history_last_updated is updated in set_min_liquidity_msat and set_max_liquidity_msat which could (but may not) be called prior to calling update_history_buckets. Do we have tests to catch this?

No, we don't have really good testing of these kinds of issues, as evidenced also by your bugfix at #2530. Luckily, doing the decaying in the background means this isn't actually a concern anymore - we only care about this in the rare case that we need to decay the buckets now, but havent run the decayer yet, and then we get a new datapoint. But, that doesn't really matter cause that's no difference than just increasing the half-life by a few minutes, which shouldn't really matter at all.

Actually, just went ahead and removed the half-life-based decay here, there's really no reason for it and we should just rely on the one in decay_liquidity_certainty.

tnull

Did an initial pass (finally, excuse the delay).

lightning/src/routing/scoring.rs

tnull · 2023-12-05T14:34:20Z

lightning-background-processor/src/lib.rs

@@ -274,7 +274,7 @@ macro_rules! define_run_body {
 	 $channel_manager: ident, $process_channel_manager_events: expr,
 	 $gossip_sync: ident, $peer_manager: ident, $logger: ident, $scorer: ident,
 	 $loop_exit_check: expr, $await: expr, $get_timer: expr, $timer_elapsed: expr,
-	 $check_slow_await: expr)
+	 $check_slow_await: expr, $time_fetch: expr)


Rather than just making this a closure, could we introduce a TimeSource trait and add a default impl based on SystemTime? It seems that we regularly need to retrieve the time in some no-std compatible way and it might be nice to make this generic (e.g., we'd also want something like that in lightning-liquidity https://github.com/lightningdevkit/lightning-liquidity/issues/54)?

In general I'd really prefer to have a million copies of "CurrentTimeFetcher" in the API everywhere, and for the most part we don't need it - we can mostly just use the latest block timestamp and call it a day (and in a few places use timer ticks if we want more granular expiry). Part of the goal of this PR is to move towards removing the Time trait, which I think is really unnecessary (and also no longer depending on it being blazing fast for routing perf).

Not sure why having a trait is directly connected to using it in scoring? We could still have a trait-based impl that is called in the background processor? Just brought it up as in lighting-liquidity we probably want to store a reference to the time source the user will give us upon setup, and it might make sense to have that generic cross compatible with other LDK crates that require the same functionality?

Right, I think my point is mostly "why can't lightning-liquidity use the block timestamp rather than the time?".

lightning-background-processor/src/lib.rs

lightning/src/routing/scoring.rs

TheBlueMatt · 2023-12-05T18:25:37Z

Addressed feedback and rebased.

tnull · 2023-12-06T11:29:46Z

This unfortunately needs a rebase now.

TheBlueMatt · 2023-12-06T20:31:45Z

Rebased on top of #2774, since it needs to go anyway.

benthecarman · 2023-12-08T01:59:23Z

lightning-background-processor/src/lib.rs

@@ -773,7 +796,10 @@ impl BackgroundProcessor {
 					handle_network_graph_update(network_graph, &event)
 				}
 				if let Some(ref scorer) = scorer {
-					if update_scorer(scorer, &event) {
+					use std::time::SystemTime;


since this functions uses system time now, it should probably be #[cfg(all(feature = "std"), not(feature = "no-std")] to handle when you'll be able to use both flags together in the future

I think just feature=std is correct. If two crates depend on LDK, with one setting std and another setting no-std, LDK should build with all features. Otherwise, the create relying on std features will fail to compile because of an unrelated crate also in the dependency tree.

TheBlueMatt · 2023-12-09T00:40:43Z

Rebased now that #2774 landed.

jkczyz · 2023-12-11T15:38:10Z

lightning-background-processor/src/lib.rs

 	match event {
 		Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_failed(path, *scid);
+			score.payment_path_failed(path, *scid, duration_since_epoch);
 		},
 		Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
 			// Reached if the destination explicitly failed it back. We treat this as a successful probe
 			// because the payment made it all the way to the destination with sufficient liquidity.
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::PaymentPathSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_successful(path);
+			score.payment_path_successful(path, duration_since_epoch);
 		},
 		Event::ProbeSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_failed(path, *scid);
+			score.probe_failed(path, *scid, duration_since_epoch);
 		},
 		_ => return false,
 	}


Yeah, decaying more often helps. For me it's more about consistency within our model -- last_updated no longer has a well-defined meaning as it may only be accurate for one offset. So we have to chose between internal consistency for a channel and consistency across channels with this approach.

jkczyz · 2023-12-11T16:11:43Z

lightning-background-processor/src/lib.rs

@@ -114,7 +114,7 @@ const ONION_MESSAGE_HANDLER_TIMER: u64 = 1;
 const NETWORK_PRUNE_TIMER: u64 = 60 * 60;

 #[cfg(not(test))]
-const SCORER_PERSIST_TIMER: u64 = 60 * 60;
+const SCORER_PERSIST_TIMER: u64 = 60 * 5;


Not sure if we should use a constant here. It should be no more than the user-defined half-life, ideally such that the half-life is divisible by it.

Hmm, I guess? If a user sets an aggressive half-life I'm not entirely convinced we want to spin their CPU trying to decay liquidity bounds. Doing it a bit too often when they set a super high decay also seems fine-ish? I agree it'd be a bit nicer to switch to some function of the configured half-life, but I'm not sure its worth adding some accessor to ScoreUpdate.

lightning-background-processor/src/lib.rs

jkczyz · 2023-12-11T16:47:42Z

lightning-background-processor/src/lib.rs

@@ -1700,7 +1732,7 @@ mod tests {
 						_ = exit_receiver.changed() => true,
 					}
 				})
-			}, false,
+			}, false, || Some(Duration::from_secs(1696300000)),


What's behind the choice of this number?

Its, basically, when I wrote the patch.

Any reason why it can't be Duration::ZERO like in the other tests?

Not really, it just seemed a bit more realistic.

Given the value doesn't affect the test, it's just curious to the reader to see something different from all the other places.

Ah, I tried to switch to ZERO but the test fails - it expects to prune entries from the network graph against a static RGS snapshot that has a timestamp in it.

lightning/src/routing/scoring.rs

TheBlueMatt · 2023-12-11T17:31:30Z

On a skylake system, 10k samples in bench gives me these changes for this branch. There's quite a bit of noise, as usual, but it does look like a non-zero win.


generate_routes_with_zero_penalty_scorer
0.0.118		                        time:   [123.62 ms 124.40 ms 125.20 ms]
current git	                        time:   [110.30 ms 110.91 ms 111.52 ms]
			                        change: [-11.604% -10.846% -10.115%] (p = 0.00 < 0.05)
bg decay	                        time:   [118.97 ms 119.83 ms 120.69 ms]
			                        change: [+7.0673% +8.0403% +9.0578%] (p = 0.00 < 0.05)

generate_mpp_routes_with_zero_penalty_scorer
0.0.118		                        time:   [92.596 ms 93.406 ms 94.223 ms]
current git	                        time:   [116.78 ms 118.15 ms 119.54 ms]
			                        change: [+24.595% +26.488% +28.359%] (p = 0.00 < 0.05)
bg decay	                        time:   [108.13 ms 109.02 ms 109.92 ms]
			                        change: [-9.1122% -7.7238% -6.4060%] (p = 0.00 < 0.05)

generate_routes_with_probabilistic_scorer
0.0.118		                        time:   [149.15 ms 149.83 ms 150.51 ms]
current git	                        time:   [164.07 ms 164.76 ms 165.45 ms]
			                        change: [+9.3074% +9.9667% +10.682%] (p = 0.00 < 0.05)
bg decay	                        time:   [136.01 ms 136.90 ms 137.80 ms]
			                        change: [-17.595% -16.910% -16.259%] (p = 0.00 < 0.05)

generate_mpp_routes_with_probabilistic_scorer
0.0.118		                        time:   [143.37 ms 144.12 ms 144.88 ms]
current git	                        time:   [155.43 ms 156.14 ms 156.84 ms]
			                        change: [+7.6204% +8.3365% +9.0807%] (p = 0.00 < 0.05)
bg decay	                        time:   [148.16 ms 149.06 ms 149.96 ms]
			                        change: [-5.2250% -4.5331% -3.7806%] (p = 0.00 < 0.05)

generate_large_mpp_routes_with_probabilistic_scorer
0.0.118		                        time:   [426.72 ms 432.73 ms 438.78 ms]
current git	                        time:   [403.56 ms 409.49 ms 415.51 ms]
			                        change: [-7.2966% -5.3700% -3.4360%] (p = 0.00 < 0.05)
bg decay	                        time:   [443.54 ms 447.70 ms 451.88 ms]
			                        change: [+7.4605% +9.3325% +11.248%] (p = 0.00 < 0.05)

generate_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [149.56 ms 150.22 ms 150.89 ms]
current git	                        time:   [148.34 ms 149.29 ms 150.24 ms]
			                        change: [-1.3545% -0.6203% +0.1605%] (p = 0.12 > 0.05)
bg decay	                        time:   [140.49 ms 141.41 ms 142.32 ms]
			                        change: [-6.1363% -5.2818% -4.4185%] (p = 0.00 < 0.05)

generate_mpp_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [151.39 ms 152.03 ms 152.67 ms]
current git	                        time:   [146.50 ms 147.28 ms 148.07 ms]
			                        change: [-3.8003% -3.1201% -2.4334%] (p = 0.00 < 0.05)
bg decay	                        time:   [142.28 ms 143.08 ms 143.87 ms]
			                        change: [-3.5925% -2.8575% -2.0824%] (p = 0.00 < 0.05)

generate_large_mpp_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [368.48 ms 372.81 ms 377.14 ms]
current git	                        time:   [403.42 ms 408.91 ms 414.47 ms]
			                        change: [+7.6832% +9.6843% +11.707%] (p = 0.00 < 0.05)
bg decay	                        time:   [350.87 ms 355.56 ms 360.27 ms]
			                        change: [-14.678% -13.047% -11.244%] (p = 0.00 < 0.05)

lightning/src/routing/scoring.rs

jkczyz · 2023-12-13T15:52:35Z

lightning-background-processor/src/lib.rs

 	match event {
 		Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_failed(path, *scid);
+			score.payment_path_failed(path, *scid, duration_since_epoch);
 		},
 		Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
 			// Reached if the destination explicitly failed it back. We treat this as a successful probe
 			// because the payment made it all the way to the destination with sufficient liquidity.
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::PaymentPathSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.payment_path_successful(path);
+			score.payment_path_successful(path, duration_since_epoch);
 		},
 		Event::ProbeSuccessful { path, .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_successful(path);
+			score.probe_successful(path, duration_since_epoch);
 		},
 		Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
 			let mut score = scorer.write_lock();
-			score.probe_failed(path, *scid);
+			score.probe_failed(path, *scid, duration_since_epoch);
 		},
 		_ => return false,
 	}


More in the sense that its use in decaying isn't well defined. We should at least note that in the decay_liquidity_certainty implementation.

jkczyz · 2023-12-13T16:03:56Z

lightning-background-processor/src/lib.rs

@@ -1700,7 +1732,7 @@ mod tests {
 						_ = exit_receiver.changed() => true,
 					}
 				})
-			}, false,
+			}, false, || Some(Duration::from_secs(1696300000)),


Any reason why it can't be Duration::ZERO like in the other tests?

lightning/src/routing/scoring.rs

lightning-background-processor/src/lib.rs

lightning/src/routing/scoring.rs

jkczyz · 2023-12-13T17:13:40Z

lightning/src/routing/scoring.rs

+				let half_life = decay_params.historical_no_updates_half_life.as_secs_f64();
+				if half_life != 0.0 {
+					let divisor = powf64(2048.0, elapsed_time.as_secs_f64() / half_life) as u64;
+					for bucket in liquidity.min_liquidity_offset_history.buckets.iter_mut() {
+						*bucket = ((*bucket as u64) * 1024 / divisor) as u16;
+					}
+					for bucket in liquidity.max_liquidity_offset_history.buckets.iter_mut() {
+						*bucket = ((*bucket as u64) * 1024 / divisor) as u16;
+					}
+					liquidity.offset_history_last_updated = duration_since_epoch;
+				}


IIUC, this means we'll decay partial half-lives but only after decaying one full half-life. Why bother with using 2048.0 and 1024 here if this is happening in the background?

Those multipliers are just to get reasonable precision. We could cast the bucket to a float and then do the whole thing in float math, but it seems easier to just keep the buckets as ints.

But why not do partial decays when less than one half-life has passed?

It kinda goes against the model of the historical buckets - they're intended to be "time-free", only using a decay parameter if we really haven't seen that channel in a long time. Now, I wouldn't be against revisiting that idea, its quite possible we over-corrected from having too much of a time parameter in the non-historical data, but I'd like to think about that separately.

I see, SGTM.

In the next commits we'll need `f64`'s `powf`, which is only available in `std`. For `no-std`, here we depend on `libm` (a `rust-lang` org project), which we can use for `powf`.

In the coming commits, we'll stop relying on fetching the time during routefetching, preferring to decay score data in the background instead. The first step towards this - passing the current time through into the scorer when updating.

Rather than relying on fetching the current time during routefinding, here we introduce a new trait method to `ScoreUpdate` to do so. This largely mirrors what we do with the `NetworkGraph`, and allows us to take on much more expensive operations (floating point exponentiation) in our decaying.

In the next commit, we'll start to use the new `ScoreUpdate::decay_liquidity_certainty` to decay our bounds in the background. This will result in the `last_updated` field getting updated regularly on decay, rather than only on update. While this isn't an issue for the regular liquidity bounds, it poses a problem for the historical liquidity buckets, which are decayed on a separate (and by default much longer) timer. If we didn't move to tracking their decays separately, we'd never let the `last_updated` field get old enough for the historical buckets to decay at all. Instead, here we introduce a new `Duration` in the `ChannelLiquidity` which tracks the last time the historical liquidity buckets were last updated. We initialize it to a copy of `last_updated` on deserialization if it is missing.

This implements decaying in the `ProbabilisticScorer`'s `ScoreLookup::decay_liquidity_certainty` implementation, using floats for accuracy since we're no longer particularly time-sensitive. Further, it (finally) removes score entries which have decayed to zero.

Because scoring is an incredibly performance-sensitive operation, doing liquidity information decay (and especially fetching the current time!) during scoring isn't really a great idea. Now that we decay liquidity information in the background, we don't have any reason to decay during scoring, and we remove the historical bucket liquidity decaying here.

Because scoring is an incredibly performance-sensitive operation, doing liquidity information decay (and especially fetching the current time!) during scoring isn't really a great idea. Now that we decay liquidity information in the background, we don't have any reason to decay during scoring, and we ultimately remove it entirely here.

Now that we aren't decaying during scoring, when we set the last_updated time in the history bucket logic doesn't matter, so we should just update it when we've just updated the history buckets.

In the coming commits, the `T: Time` bound on `ProbabilisticScorer` will be removed. In order to enable that, we need to pass the current time (as a `Duration` since the unix epoch) through the score updating pipeline, allowing us to keep the `*last_updated_time` fields up-to-date as we go.

In the coming commits, the `T: Time` bound on `ProbabilisticScorer` will be removed. In order to enable that, we need to switch over to using the `ScoreUpdate`-provided current time (as a `Duration` since the unix epoch), making the `T` bound entirely unused.

Now that we don't access time via the `Time` trait in `ProbabilisticScorer`, we can finally drop the `Time` bound entirely, removing the `ProbabilisticScorerUsingTime` and type alias indirection and replacing it with a simple struct.

As we now no longer decay bounds information when fetching them, there is no need to have a decaying-fetching helper utility.

This is a good gut-check to ensure we don't end up taking a ton of time decaying channel liquidity info. It currently clocks in around 1.25ms on an i7-1360P.

Now that the serialization format of `no-std` and `std` `ProbabilisticScorer`s both just use `Duration` since UNIX epoch and don't care about time except when decaying, we don't need to warn users to not mix the scorers across `no-std` and `std` flags. Fixes lightningdevkit#2539

There's some edge cases in our scoring when the information really should be decayed but hasn't yet been prior to an update. Rather than try to fix them exactly, we instead decay the scorer a bit more often, which largely solves them but also gives us a bit more accurate bounds on our channels, allowing us to reuse channels at a similar amount to what just failed immediately, but at a substantial penalty.

Because we decay the bucket information in the background, there's not much reason to try to decay them immediately prior to updating, and in removing that we can also clean up a good bit of dead code, which we do here.

Now that we use explicit times passed to decay methods, there's no reason to make calls to `SinceEpoch::advance` in scoring tests.

TheBlueMatt · 2023-12-13T23:26:27Z

Squashed the fixups, without any changes.

jkczyz · 2023-12-15T00:38:52Z

Code overall looks good. I'm not opposed to the approach, necessarily. I think there still can be an issue decaying in the case of a node with a large payment volume (See #2656 (comment)).

Otherwise, I don't have any other concerns. @tnull Could you take another look?

TheBlueMatt · 2023-12-15T02:31:10Z

Code overall looks good. I'm not opposed to the approach, necessarily. I think there still can be an issue decaying in the case of a node with a large payment volume (See #2656 (comment)).

Yea, I mean its definitely not "right", just not clear to me its "wrong" either. The only thing we could really do to address it is split the last_updated fields into two. I'm totally fine doing that if you think its worth it.

tnull

Did another pass.

LGTM I think. ACKing in case we want to land this soon, but might come back for yet another pass next week.

jkczyz · 2023-12-15T18:46:05Z

Yea, I mean its definitely not "right", just not clear to me its "wrong" either. The only thing we could really do to address it is split the last_updated fields into two. I'm totally fine doing that if you think its worth it.

I guess it just can come across as surprising to an outside observer. They can tell that the a bound has been adjusted, but one of them doesn't seem to be decaying as expected based on the configuration and the observed time of adjustment. Maybe it's not horrible in practice?

TheBlueMatt · 2023-12-15T18:52:07Z

I guess to an outside observer it just looks like both ends got updated? Which isn't true, but not crazy broken either (at least in the sense that I'm not sure what outside observer would be looking at both their failures and their success/failure stream and somehow caring about the inconsistent decays). I'm happy with either solution, though - as-is or splitting the last-updated tracking. We could also split the last-updated tracking in a followup, it'd be a new commit either way.

jkczyz · 2023-12-15T20:00:17Z

I guess to an outside observer it just looks like both ends got updated? Which isn't true, but not crazy broken either (at least in the sense that I'm not sure what outside observer would be looking at both their failures and their success/failure stream and somehow caring about the inconsistent decays). I'm happy with either solution, though - as-is or splitting the last-updated tracking. We could also split the last-updated tracking in a followup, it'd be a new commit either way.

Was thinking in terms of something like estimated_channel_liquidity_range and a prober that may use the data to make decisions on future probes. An outsider can observe the bounds changing over time and from events.

But, yeah, a follow-up is fine. Seems like we can always do it later if we find problems, too, given the serialization format. Just seems more correct to use separate timestamps.

TheBlueMatt · 2023-12-15T20:06:27Z

Alright, gonna merge this. I have a large pile of performance tweaks to the router and scorer up next, and I can incorporate a separate decay for the two bounds in some of that work.

jkczyz · 2023-12-15T21:09:31Z

lightning/src/routing/scoring.rs

-			usage.inflight_htlc_msat = 0;
-			assert_eq!(scorer.channel_penalty_msat(&candidate, usage, &params), 866);


Was just fixing a build warning and noticed this. Why did this check need to be removed? Deleted in 35b4964.

The previous test relied on the behavior where we actually used undecayed data in the buckets when scoring, and only considered the decaying when deciding if we should score at all. We now actually decay the data so do not have the undecayed data available.

tnull self-requested a review October 12, 2023 08:12

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 8d184b7 to f4fac7c Compare November 1, 2023 02:44

TheBlueMatt linked an issue Nov 5, 2023 that may be closed by this pull request

Support loading a std-persisted ProbabilisticScorer in no-std #2539

Closed

TheBlueMatt mentioned this pull request Nov 14, 2023

Decay liquidity history buckets appropriately #2530

Closed

TheBlueMatt added this to the 0.0.119 milestone Nov 15, 2023

jkczyz reviewed Nov 15, 2023

View reviewed changes

jkczyz reviewed Nov 28, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 3 times, most recently from 0498a13 to 60be6f9 Compare November 29, 2023 03:20

tnull reviewed Dec 5, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 60be6f9 to 507b842 Compare December 5, 2023 18:25

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 2 times, most recently from 2338237 to 2dce56d Compare December 6, 2023 20:31

benthecarman reviewed Dec 8, 2023

View reviewed changes

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 2dce56d to 3711994 Compare December 9, 2023 00:40

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 3711994 to 506c068 Compare December 9, 2023 00:53

jkczyz reviewed Dec 11, 2023

View reviewed changes

jkczyz reviewed Dec 13, 2023

View reviewed changes

TheBlueMatt added 2 commits December 13, 2023 18:36

Depend on libm in no-std for powf(64)

6471eb0

In the next commits we'll need `f64`'s `powf`, which is only available in `std`. For `no-std`, here we depend on `libm` (a `rust-lang` org project), which we can use for `powf`.

Pass the current time through ScoreUpDate methods

6c366cf

In the coming commits, we'll stop relying on fetching the time during routefetching, preferring to decay score data in the background instead. The first step towards this - passing the current time through into the scorer when updating.

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 506c068 to bbaf6b8 Compare December 13, 2023 18:37

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from d9b0f37 to 238c459 Compare December 13, 2023 21:00

TheBlueMatt added 15 commits December 13, 2023 23:26

Update history bucket last_update time immediately on update

5ac68c1

Now that we aren't decaying during scoring, when we set the last_updated time in the history bucket logic doesn't matter, so we should just update it when we've just updated the history buckets.

Drop now-unused T: Time bound on ProbabilisticScorer

d15a354

Now that we don't access time via the `Time` trait in `ProbabilisticScorer`, we can finally drop the `Time` bound entirely, removing the `ProbabilisticScorerUsingTime` and type alias indirection and replacing it with a simple struct.

Drop now-trivial decayed_offset_msat helper utility

512f44c

As we now no longer decay bounds information when fetching them, there is no need to have a decaying-fetching helper utility.

Add a benchmark for decaying a 100k channel scorer's liquidity info

40b4094

This is a good gut-check to ensure we don't end up taking a ton of time decaying channel liquidity info. It currently clocks in around 1.25ms on an i7-1360P.

Drop half-life-based bucket decay in update_history_buckets

18b4231

Because we decay the bucket information in the background, there's not much reason to try to decay them immediately prior to updating, and in removing that we can also clean up a good bit of dead code, which we do here.

Drop fake time advancing in scoring tests

f8fb70a

Now that we use explicit times passed to decay methods, there's no reason to make calls to `SinceEpoch::advance` in scoring tests.

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch from 238c459 to f8fb70a Compare December 13, 2023 23:26

TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 2 times, most recently from 4e9783e to f8fb70a Compare December 15, 2023 04:57

tnull approved these changes Dec 15, 2023

View reviewed changes

jkczyz approved these changes Dec 15, 2023

View reviewed changes

TheBlueMatt merged commit c92db69 into lightningdevkit:main Dec 15, 2023

jkczyz reviewed Dec 15, 2023

View reviewed changes

		*self.last_updated = duration_since_epoch;
		*self.offset_history_last_updated = duration_since_epoch;

		let existing_max_msat = self.max_liquidity_msat();
		if amount_msat < existing_max_msat {

		usage.inflight_htlc_msat = 0;
		assert_eq!(scorer.channel_penalty_msat(&candidate, usage, &params), 866);

Stop decaying liquidity information during scoring #2656

Stop decaying liquidity information during scoring #2656

Uh oh!

Conversation

TheBlueMatt commented Oct 9, 2023

Uh oh!

codecov-commenter commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheBlueMatt commented Nov 1, 2023

Uh oh!

TheBlueMatt commented Nov 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 9, 2023 •

edited

Loading

TheBlueMatt Nov 29, 2023 •

edited

Loading

TheBlueMatt Nov 29, 2023 •

edited

Loading

TheBlueMatt Dec 5, 2023 •

edited

Loading