Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4/5]sweep: deepen Aggregator interface #8148

Merged

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Nov 3, 2023

This PR prepares for #8424 by providing two main changes,

  • Timestamp-based batch ticker is replaced by blockstamp-based batch ticker. We want to batch as many transactions as possible without degrading the security. Previously our ticker is timestamp-based, with a default value of 30s. This 30s provides little utility in terms of batching, especially when the inputs still have CLTV that's hundreds of blocks away. Thus, it's safe to change it and use blockstamp instead, which makes our fee bumper easier to implement. For time sensitive inputs, callers need to set the Force flag and UtxoSweeper will now sweep them immediately, which is also different as previously this Force flag would only trigger the ticker to start.
  • the Aggregator interface is deepened and now return the InputSet interface. This InputSet gives a set of inputs to be included in the final sweeping transaction. It also tells UtxoSweeper whether the current set needs wallet inputs and, if so, provides a method to add them to the set. On the other hand, making this new interface makes it easier to write unit tests.

In the final PR, new implementations of Aggregator and InputSet are created to provide the final fee bumper.

TODO

  • fix existing unit tests
  • add unit tests for new code
  • fix existing itests

Fixes #8391

@saubyk
Copy link
Collaborator

saubyk commented Nov 28, 2023

cc: @Roasbeef for approach ack here

@morehouse
Copy link
Collaborator

IIUC, this PR does a few things:

  • moves fee estimator to an interface, allowing future alternative estimators
  • moves InputSet to an interface (I'm not sure why)
  • extracts input clustering logic into SimpleAggregator and creates the UtxoAggregator interface, allowing future alternative aggregators
  • changes sweep frequency to once per block, instead of every 30 seconds
  • adds RBF info (fee and feerate) to pending inputs and uses it to:
    • not include inputs in the sweep tx when the fee rate hasn't been increased
    • do a minimal fee bump for forced sweeps

Am I misunderstanding or missing anything?


I would find this easier to review if it was broken into smaller PRs. I think we could peel off some straightforward refactoring PRs and leave this PR for the new stuff.

@morehouse
Copy link
Collaborator

For the RBF info, why load it from the DB when we can extract it directly from the spend tx in the mempool? I imagine this would be more resilient in situations where someone besides us could broadcast a competing transaction (e.g., HTLC sweeps, anchors).

IIUC, the current code just "gives up" if someone else attempts a sweep of the inputs first. What's the reasoning here?

@yyforyongyu
Copy link
Member Author

@morehouse nice summary! Yeah can def create smaller PRs, but want to have a concept ACK first.

For the RBF info, why load it from the DB when we can extract it directly from the spend tx in the mempool?

We def can, like a combination of gettxspendingprevout and testmempoolaccept should give us enough info to create a tx that's accepted by the mempool. But it won't work for neutrino backend as there's no mempool. So we still need some ways, tho not guaranteed working, to enable RBF when using neutrino.

@morehouse
Copy link
Collaborator

* moves fee estimator to an interface, allowing future alternative estimators

Concept ACK.

* moves `InputSet` to an interface (I'm not sure why)

Is this change needed? What alternative InputSets do we plan to implement in the future?

* extracts input clustering logic into `SimpleAggregator` and creates the `UtxoAggregator` interface, allowing future alternative aggregators

Concept ACK.

* changes sweep frequency to once per block, instead of every 30 seconds

Concept ACK.

* adds RBF info (fee and feerate) to pending inputs and uses it to:
  
  * _not_ include inputs in the sweep tx when the fee rate hasn't been increased
  * do a minimal fee bump for forced sweeps

Concept ACK. I'm not 100% sure about the best approach for implementing this yet -- will think some more.

 But it won't work for neutrino backend as there's no mempool. So we still need some ways, tho not guaranteed working, to enable RBF when using neutrino.

I think the current implementation also doesn't work for neutrino then, since it relies on checking the mempool to decide whether to load the RBF info from the DB.

@yyforyongyu yyforyongyu mentioned this pull request Dec 5, 2023
4 tasks
@Crypt-iQ
Copy link
Collaborator

I think we should try to adhere to RBF as closely as possible so we can ensure that we can get our sweep transactions in the mempool. There are some things missing in this PR, I think due to the sweeper's current architecture that prevent us from aligning with Core's RBF policy (https://github.com/bitcoin/bitcoin/blob/master/doc/policy/mempool-replacements.md). This might take several PRs and a bit of a rework to get right.

  • Rule 2: The replacement transaction only include an unconfirmed input if that input was included in one of the directly conflicting transactions. An unconfirmed input spends an output from a currently-unconfirmed transaction.

    • Currently the sweeper could broadcast Tx 1 and later decide to RBF it with Replacement Tx that includes an anchor spend of an unconfirmed Commit Tx.
      RBF anchors
    • Since the original transaction (Tx 1) doesn't have the anchor input, this Replacement Tx would be rejected from the mempool.
    • Following rule 2 means treating unconfirmed inputs (think this only includes anchor spends) separately and not trying to RBF with them
  • Rule 3: The replacement transaction pays an absolute fee of at least the sum paid by the original transactions.

    • This requires the sweeper to track which transactions its replacing and sum them. In cases where we compete with the counterparty, this would require us to scan the mempool via gettxspendingprevout.
  • Rule 4: The additional fees (difference between absolute fee paid by the replacement transaction and the sum paid by the original transactions) pays for the replacement transaction's bandwidth at or above the rate set by the node's incremental relay feerate. For example, if the incremental relay feerate is 1 satoshi/vB and the replacement transaction is 500 virtual bytes total, then the replacement pays a fee at least 500 satoshis higher than the sum of the original transactions.

    • Like rule 3, this requires tracking which transactions are getting replaced.
  • Rule 6: The replacement transaction's feerate is greater than the feerates of all directly conflicting transactions.

    • This should already be handled by this PR which ensures that the replacement tx has a higher feerate than what it's replacing.

Copy link
Contributor

coderabbitai bot commented Jan 25, 2024

Important

Auto Review Skipped

Auto reviews are disabled on base/target branches other than the default branch. Please add the base/target branch pattern to the list of additional branches to be reviewed in the settings.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@yyforyongyu yyforyongyu changed the title sweep: make sweeping inputs RBF-aware [4/5]sweep: deepen Aggregator interface Jan 25, 2024
@yyforyongyu yyforyongyu requested review from morehouse and ziggie1984 and removed request for morehouse and ziggie1984 January 25, 2024 01:45
@yyforyongyu yyforyongyu force-pushed the sweeper-rbf-aware branch 2 times, most recently from 50d7bf7 to 1156e5c Compare January 29, 2024 06:26
@yyforyongyu yyforyongyu added the llm-review add to a PR to have an LLM bot review it label Jan 30, 2024
@ProofOfKeags ProofOfKeags self-requested a review February 8, 2024 22:24
Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another banger. This helps clarify some of my confusion from some previous PRs. I like that we only have the InputSet now instead of both Cluster and InputSet. Had some questions and comments to look at but it's small.

sweep/aggregator.go Outdated Show resolved Hide resolved
@@ -86,7 +224,19 @@ func (s *SimpleAggregator) ClusterInputs(inputs pendingInputs) []inputCluster {
// Since the inputs that we clustered by fee rate don't commit to a
// specific locktime, we can try to merge a locktime cluster with a fee
// cluster.
return zipClusters(lockTimeClusters, feeClusters)
clusters := zipClusters(lockTimeClusters, feeClusters)
Copy link
Collaborator

@ProofOfKeags ProofOfKeags Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the assumed order of these cluster lists? It's not obvious why I should be able to zip these clusters together this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function could be renamed to make it a bit more clear, but what it does today is merged the two lists, preserving the fee rate ordering from the first. In other words, lockTimeClusters will have merged into the items from feeClusters that are as high as the element in the lock time cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be careful here as a zip operation (pairwise combination) needs to make sure there is proper index alignment.

Diving into the implementation details it seems that this zipClusters operation isn't actually a zip (at least with the terminology I'm familiar with) because it does two independent sorts by feerate on each vector of inputs. What concerns me further here is that since feerates are not enumerable it means that we can't really reason about the relative densities between the two slices. If we don't have some sort of guarantee on how dense each slice is wrt feerate gain, then a single advancement of the j cursor per i cursor advancement may not make sense. I'm still trying to wrap my head around what the cluster operations are. Maybe a high level pseudocode description without loops and cursors is needed.

My understanding here is that clusterByLocktime allocates a bunch of locktime buckets, into which we are distributing the non-locktime inputs provided they meet some sort of fee distance criteria (still trying to figure out exactly what criteria that is).

that are as high as the element in the lock time cluster

I don't see any guarantee that there is a single element in the locktime cluster, so maybe it is if the input is at least as high as the cluster average feerate? That would make sense to me because we wouldn't want to drag down the feerate of a cluster that had time sensitivity.

@yyforyongyu can you confirm this interpretation for me and then maybe we can work on making it more obvious in code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the current clustering logic is unnecessarily complicated, I'm in favor of removing SimpleAggregator once the BudgetAggregator is in place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but can we confirm or correct my above interpretation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've learnt from TestZipClusters, think once talked about this in a study group - my understanding is that, once clustered by locktimes and fee rate buckets, these clusters are merged if the timelock clusters have a smaller fee rate than the fee rate clusters.

The effect is, for locktime clusters, their fee rates will be increased if the non-locktime clusters have higher ones.


// TestCreateInputSets checks that ...
func TestCreateInputSets(t *testing.T) {
// TODO(yy): add it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃

sweep/tx_input_set.go Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated
Comment on lines 1561 to 1562
if err != nil {
log.Errorf("input cluster sweep: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we ought to do more than logging here?

sweep/tx_input_set.go Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated
// Reschedule the inputs that we just tried to sweep. This is done in
// case the following publish fails, we'd like to update the inputs'
// publish attempts and rescue them in the next sweep.
s.rescheduleInputs(tx.TxIn)
err = s.markInputsPendingPublish(tr, tx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that only the tx record is stored on disk (and not al information for all the inputs), what use is the pending vs published state? We don't attempt to recover the past published inputs from disk on start up, so knowing if something was pending published vs actually published (given if we fail, we'll mark the publish as fail) isn't necessarily an actionable item.


args := m.Called(op)

if args.Get(0) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be nil if a concrete outpoint is passed in?

sweep/aggregator.go Outdated Show resolved Hide resolved
sweep/sweeper_test.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/tx_input_set.go Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
This commit changes the source that drives the state changes in the
sweeper. Previously we used a ticker with default interval of 30s to
trigger sweepings periodically. The assumption is, within this 30s we'd
batch multiple inputs into one transaction to maximize profits. However,
the efficacy of this batch is questionable.

At a high level, we can put our inputs into two categories - one that's
forced, and one that's not. For forced inputs, we should sweep them
immediately as the force flag indicates they are very urgent, eg,
CPFPing the force closing tx. For non-forced inputs, such as anchors
or HTLCs with CLTV that's far away, we can wait to sweep them till a new
block comes in and triggers the sweeping process.

Eventually, all inputs will be deadline-aware, and the sweeper will
consult our fee bumper about the most economical fee rate to be used for
a given deadline. Since the deadlines here are blockstamp, it's also
easier to manage them if the sweeper is also using blockstamp instead of
timestamp.
This commit adds a new interface `Cluster` to manage cluster-level
inputs grouping. This new interface replaces the `inputCluster` and will
be futher refactored so the sweeper can use a much smaller coin
selection lock.
Previously the fee rate is tracked at cluster level, which may not be
accurate as each cluster is then divided into input sets. And these sets
are what's actually included in the final sweeping tx. To properly
reflect the final fee rate used in the sweeping tx, `InputSet` is added
so more customized clustering logic can be implemented in the future.
For intance, atm it's clustered by fee rates, while we may also cluster
by deadlines, urgencies, etc.
This commit makes the `ClusterInputs` directly returning the `InputSet`
so the sweeper doesn't know about the existence of `Cluster` interface.
This way we can have a deeper interface as the sweeper only needs to
interact with `Aggregator` only to get the final input sets, leaving the
implementation details being managed by `SimpleAggregator` and future
aggregators.
This commit changes how `WithCoinSelectLock` is used - previously the
lock is held when creating the input sets, now it's only be held after
the input sets have been created and explicitly signal they need wallet
inputs.
@Roasbeef Roasbeef merged commit af9826c into lightningnetwork:elle-sweeper-rbf-2 Mar 27, 2024
24 of 27 checks passed
@@ -320,7 +320,7 @@ func TestClusterByLockTime(t *testing.T) {
)

// Create a test aggregator.
s := NewSimpleUtxoAggregator(nil, maxFeeRate)
s := NewSimpleUtxoAggregator(nil, maxFeeRate, 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant here: DefaultMaxInputsPerTx ?

@@ -1609,8 +1609,9 @@

[sweeper]

; Duration of the sweep batch window. The sweep is held back during the batch
; window to allow more inputs to be added and thereby lower the fee per input.
; DEPRECATED: Duration of the sweep batch window. The sweep is held back during
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even delete it from the sample.conf because it has no effect. DEPRECATED in my understand suggests that it should not be used but will still work correctly ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the linter won't pass if it's not removed, I think in 0.19 I'll remove lots of the deprecated options, just need to figure out how to not break user's scripts.

Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a security concern about reducing the sweeping interval to once per block -- see inline comment.

Perhaps we should keep the 30s sweep ticker.

"sweeping %d inputs", epoch.Height, len(inputs))

// Attempt to sweep any pending inputs.
s.sweepPendingInputs(inputs)
Copy link
Collaborator

@morehouse morehouse Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not broadcasting new inputs until the next block confirms, we effectively reduce all confirmation windows by 1. So if the deadline is set to 1 (next block), we will automatically miss it.

Obviously the goal is to offer the sweeper inputs much earlier, but if the node is late for some reason (e.g., was offline) this could be the difference between recovering funds and losing them.

This also may make it significantly easier to steal funds via DoS attacks -- after each restart the victim will wait ~10 minutes (1 block) before attempting to broadcast the sweep transaction instead of 30 seconds. This gives the attacker a lot of time to crash the victim again. Consider the following attack:

  1. Mallory routes an HTLC through Alice to Bob.
  2. Bob claims the HTLC from to Alice.
  3. Mallory ignores Alice's attempt to claim the HTLC.
  4. 10 blocks before the HTLC times out, Alice's contractcourt decides to go onchain and offers the anchor spend to the sweeper. The sweeper starts waiting for the next block to broadcast.
  5. Mallory crashes Alice using some unpatched DoS vector.
  6. Every time Alice comes back online, Mallory starts DoSing her again. As long as Mallory can keep crashing Alice before the next block confirms, Alice's sweeper will never broadcast any transactions.
  7. After 10 blocks, Mallory claims the HTLC timeout path on chain, successfully stealing the HTLC from Alice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can crash you every block why couldn't I crash you every 30s?

I don't see an issue to moving to a timer, though. If you think that makes a material difference, we can probably change that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can crash you every block why couldn't I crash you every 30s?

Maybe, maybe not. Some DoS attacks cause crashes instantly (e.g., nil dereference, OOB), others overload the victim more slowly and take longer to cause crashes (e.g., OOM, disk-based DoS).

I still think our best defense against DoS is #8166. But even if we have that, we still need to ensure that sweep transactions get broadcast immediately-ish on startup. If we have to wait 10 minutes, #8166 doesn't help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm don't we have the Force Flag in place for time-sensitive sweeps which would let us sweep the inputs during restarts as soon as the resolvers register it with the sweeper ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the DoS and prevent it from happening, rather than trying to build a system assuming there's one - I don't think there's much you can do in that case.

Some users don't use the default 30s, our docs recommend 10m, and some may choose hours. And in the given example, if I'm the attacker, I'd crash Alice's node before step 4, rendering her contractcourt useless.

after each restart the victim will wait ~10 minutes (1 block) before attempting to broadcast the sweep transaction instead of 30 seconds.

I think what's missing from our sweeper is the ability to recover from restart - which has been discussed multiple times in these PR series. The sweeper is almost stateless atm, and it should pick up the existing pending inputs during restart, and offer them to the fee bumper immediately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Designing the system to protect funds under DoS is within reach, so let's do it.

Is it? I am not so sure I agree.

#8166 seems well within reach to me. I think significantly less work is needed than has been done on this set of sweeper PRs already.

The premise you're starting with is that there are DoS vectors that are abundant, and if DoS can take the node down, and there exist deadline based mechanisms in LN, funds will always be at stake. What am I missing?

Define abundant? All an attacker needs is 1 DoS vector to cause mayhem with current LND nodes. Are you confident there are 0 DoS vectors in LND today and forever into the future?

The goal is to keep funds safe even if the node is repeatedly crashed.

I definitely agree that we should begin sweeping on startup. There is no reason to delay arbitrarily. This seems like an orthogonal issue to whether subsequent sweep operations are block driven or some other heartbeat driven.

Yes, if we set up the contractcourt and sweep on startup before allowing any peer connections, then #8166 has essentially been accomplished. That's the goal.

Until then, a 30s delay before sweeping on startup seems better than 10 minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All an attacker needs is 1 DoS vector to cause mayhem with current LND nodes. Are you confident there are 0 DoS vectors in LND today and forever into the future?

Absolutely not :).

I'm also not really contesting this. What I'm contesting is what can be done about it. I think #8166 is a generally good idea but it ignores the risks incurred by being disconnected from our peers. I think the assumption that we have to be theoretically crash proof at all times is both unrealistic and a distraction from more pressing issues. I think if you want to make things more crash proof, then decoupling the subsystems is yet another way to handle it. This way crashing the peer interface doesn't take down the whole node, just that particular system.

My main question is that if we're starting from having failure assumptions "someone can crash my node", how broken is reasonable for us to design for? Why not design things in a way where each subsystem can't rely on any assumptions from any other system? It quickly gets into territory that doesn't make any sense.

I think fault tolerance is a good goal but I also think we need to be careful to not drive ourselves into spending lots of effort on the ever rarer tail events and not accomplish anything while we try to cover edge cases that don't materialize.

I'm open to having my mind changed here, I just generally don't find myself compelled by the argument. One thing that could help me sway is for you to explain why the line in the sand you've happened to draw is unique, and not just an arbitrary point on the continuum towards maximally fault-tolerant software that doesn't consider other dimensions of the problem. Otherwise the slope to me seems rather slippery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mallory crashes Alice using some unpatched DoS vector.
Every time Alice comes back online, Mallory starts DoSing her again. As long as Mallory can keep crashing Alice before the next block confirms, Alice's sweeper will never broadcast any transactions.

So let's say that Alice had a trigger to not accept any connections until all contracts have been resolved. How do we define "resolved" here? Will Alice wait until ultimate confirmation of her sweep? If so, then we also need to be ready for the possibility that there's some crazy fee spike, and confirmation takes much longer than anticipated. If we don't accept any connections during this interval, then it's possible that other active HTLCs we had with other peers, or even other channels with Alice also times out.

In short, we also need to weigh the opportunity costs of time outs and force closes of other channels if we don't reconnect to any peers until all on chain contracts have been resolved.

I think what we need to move forward here is a concrete proposal w.r.t what the trigger is to resume reconnections re #8166. I think in order to fully isolate we shouldn't even create the listener until we've resolved everything on chain. Of course, kernel level network DoS would still apply, but we only control the application layer in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can re-add the timer in addition to the block based model, but in itests, we'd essentially disbale the timer all together (as they were all updated to be block based after this change).

Before this series of changes, we had the timer as we would at times knowingly broadcast garbage. So we had to keep retrying in the hop that we actually broadcasted something that would confirm, which would then let us retarget to remove the other garbage. With the new approach, we'll no longer broadcast anything that wouldn't be accepted into our node's mempool, so the move to block based (the only time the UTXO state can change) made more sense.

Copy link
Collaborator

@morehouse morehouse Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a more concrete explanation to #8166 (comment). Let's move that discussion over there.

Assuming the new sweeper will be merged before any work on #8166, I think it's worth adding the sweep timer back in for now. It can be removed once properly do sweeping on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm-review add to a PR to have an LLM bot review it rbf refactoring utxo sweeping
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants