Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Graceful shutdown #4842

Open
whitslack opened this issue Oct 6, 2021 · 14 comments
Open

Feature request: Graceful shutdown #4842

whitslack opened this issue Oct 6, 2021 · 14 comments

Comments

@whitslack
Copy link
Collaborator

@ZmnSCPxj explained:

HTLCs being in-flight are a normal part of operation and that state can persist for days; we cannot delay daemon shutdown until there are no in-flight HTLCs when in-flight HTLCs can persist for days at a time.

It is certainly true that daemon shutdown cannot be deferred until all HTLCs are cleared, but a significant improvement in Lightning UX could be achieved by implementing a graceful shutdown that would block the addition of new HTLCs and wait for up to a specified timeout for all HTLCs to clear.

Feature Request

  • Add an optional timeout parameter to the stop RPC to specify a graceful shutdown timeout.
  • When stop begins executing, begin refusing all requests to add HTLCs to channels (both from peers and from local commands).
  • When there are no in-flight HTLCs, terminate the daemon.
  • When the timeout expires, terminate the daemon.

Implementing this feature would reduce the occurrence of slow payment attempts for users of the Lightning Network. Rebooting a C-Lightning server can take several minutes, during which time any users with in-flight HTLCs must wait. This is bad UX and is not helping Lightning adoption. We can't easily fix HTLCs that go out to lunch and never return, but we can avoid dropping fresh HTLCs on the floor while we go out to lunch.

@whitslack
Copy link
Collaborator Author

N.B.: @ajpwahqgbi has attempted to write a C-Lightning graceful shutdown plugin, but it doesn't work correctly, as it apparently blocks the daemon's event loop and prevents any HTLCs from clearing out at all, thus actually exacerbating the problem it's trying to remedy. Also, it lacks a timeout.

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 6, 2021

This may not be (perfectly) possible in terms of the BOLT protocol.

In BOLT#2, there is an implicit ordering requirement. Suppose you are doing (or want to do, ha) a graceful shutdown, and I am your peer. There is an HTLC you already offered to me. Now, suppose I send the following messages to you, in the specified order:

  • update_add_htlc of a brand-new HTLC.
  • update_fulfill_htlc of the previous HTLC you offerred to me.

You can either respond:

  • To neither message.
  • To the first message.
  • To the first and second message.

However, note that the first message adds an HTLC, which is further from your goal of reducing the number of HTLCs in-flight before you finalize shutdown.

You cannot respond to just the second message and remove an HTLC without adding the first message HTLC.

The first option (respond to neither message) is equivalent to just shutting down with a SIGKILL immediately then and there instead of trying for a timed-out graceful shutdown. The third option is not a marked improvement: you remove one HTLC but gain another anyway.

I think that properly speaking we may need a spec change, where a node tells its peers "no more update_add_htlcs until I reconnect" but will still accept fulfilling and failing HTLCs to reduce in-flight HTLCs.

--

However, we can (maybe?) offer a partial solution.

We can have channeld accept a graceful_shutdown message, which puts it in graceful_shutdown mode. It should also probably get that flag in its init message too.

In this mode, channeld will handle update_fulfill_htlc and update_fail_htlc as normal. However, if it receives update_add_htlc it will immediately exit, which automatically closes the peer connection and appears to the peer as a sudden shutdown of our node.

This at least lets us have some chance to reduce the number of HTLCs without gaining more HTLCs, without having to modify the spec (which would take longer).

@whitslack
Copy link
Collaborator Author

@ZmnSCPxj: In your example, why not reply to the update_add_htlc with a temporary failure 0x2002 (temporary_node_failure) and process the update_fulfill_htlc normally? We should endeavor to keep the channeld running if there are any HTLCs in flight. The channeld processes for channels with no HTLCs in flight can be terminated immediately (i.e., sooner than lightningd times out and terminates).

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 6, 2021

You cannot reply to update_add_htlc with a routing failure --- that is done with a later update_fail_htlc. Unfortunately, that means accepting the update_add_htlc now, then after the HTLC has been irrevocably committed (and therefore we have exchanged signatures and added the new HTLC as in-flight), we will then update_fail_htlc with the temporary_node_failure. The counterparty might not reply immediately (we have no control over their code) and our graceful shutdown timer might time out (or your systemd gets bored and decides to SIGKILL you) before they reply, meaning we "gracefully" shut down with an additional HTLC than when the grace period started.

So no, your only recourse is to drop the connection so that neither you nor the counterparty ever instantiate the new incoming HTLC. There is still the chance they reconnect within the grace period and then decide not to update_add_htlc but still do the update_fulfill_htlc of the existing HTLC, in that case, depending on how it is coded (maybe?).

(In fact I believe we do this now, if we have a just-arrived HTLC that is to be forwarded, but before we actually have sent out the update_add_htlc to the next hop, and we get shutdown, we will temporary_node_failure the HTLC; though @rustyrussell probably knows the HTLC state machine much better than I do.)

@whitslack
Copy link
Collaborator Author

Damn. Who the hell designed this protocol? It has so many novice mistakes. Anyway, thanks for your explanations.

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 6, 2021

N00b LN devs 4->5 years ago (all of us were n00bs, LOL).

Mostly the concern then was to ensure that we could always have a fallback safely on-disk before doing potentially funds-losing operations, which is why the protocol was designed that way; graceful shutdowns were not a consideration, but sudden unexpected I-tripped-over-the-power-cord-sorrry shutdowns were, and the protocol was designed so that sudden disconnections are perfectly fine (which is why I suggest just disconnecting, the protocol is designed to survive that and that code path is well-tested in all implementations). It is much easier to implement and test your implementation if you always process messages in the order you receive them rather than supporting a "I can respond to that message but ignore this other message", especially if you need to save stuff on disk and then re-establish later with the peer from your saved on-disk stuff. Graceful shutdowns were not a concern, ungraceful kick-the-cord shutdowns were.

@whitslack
Copy link
Collaborator Author

For what it's worth, I don't mean that responding to protocol messages in order is a mistake. I only meant that having no mechanism for rejecting protocol messages (other than disconnecting) seems like an oversight. Similarly with the whole thing where you can't send an "error" to your channel peer without triggering them to force-close your channel.

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 6, 2021

Well, I suppose I now have to detail it exactly, and the rationale. You can try reading through BOLT#2 too.

Basically, I lied when I said your options are to respond to none of the updates, one of them, or both. In reality, multiple update_* messages are effectively batched into a single Poon-Dryja update (i.e. signature followed by revocation). And it is only the entire batch you can ignore or reject, because it is the signing and revocation which are "the real" events that commit everyone to a new state.

So your only real options, if the counterparty gave you update_add_htlc followed by update_fail_htlc, are:

  • Take the entire batch (which does not reduce the number of HTLCs in flight).
  • Ignore the entire batch (which keeps your current number of HTLCs in flight).

You ignore the batch by ignoring their commitment_signed and not giving back a revoke_and_ack, but in practice most implementations will stall waiting on your revoke_and_ack and that is no better than just disconnecting to reject the entire batch (and disconnecting is just plain cleaner and is less likely to trigger edge-case bugs).

The reason to batch is that some of the low-level networking protocols are much more amenable to having as few turnarounds as possible. Meaning you would prefer to have one side do multiple message sends, then the other side. For example it might use a shared media (Ethernet, WiFi), or the protocol might have high latency but high bandwidth (Tor, long-distance cable). For my example it would look like:

You                         Me
 |                          |
 |<-----update_add_htlc-----|
 |<---update_fulfill_htlc---|
 |<----commitment_signed----|
 |                          |
 |------revoke_and_ack----->|
 |-----commitment_signed--->|
 |                          |
 |<-----revoke_and_ack------|
 |                          |

Now, what you want to do would be something like this:

You                         Me
 |                          |
 |<-----update_add_htlc-----|
 |                          |
 |-------accept_update----->|
 |                          |
 |<---update_fulfill_htlc---|
 |                          |
 |-------accept_update----->| 
 |                          |
 |<----commitment_signed----|
 |                          |
 |------revoke_and_ack----->|
 |-----commitment_signed--->|
 |                          |
 |<-----revoke_and_ack------|
 |                          |

Because of the need to turn around and accept each individual update_ message, the latency is worsened; there are now multiple turnarounds, which is bad for shared-media network transports and for transports with high latency. We could do some amount of batching negotiation, i.e. there could be a "okay that is the batch of changes I want to add" which should be replied with "hmm I only accept these particular changes" before signing and revoking, but that still adds one more turnaround.

And a good part of the slowness in forwarding is due precisely to these turnarounds or roundtrips. Already with the current design (which is already optimized as well as we can figure out, barring stuff like my weird Fast Forwards idea that you can read about on the ML) you get the 1.5 roundtrips, and that is repeated at each hop in a payment route. Even with a "I only accept these particular changes in that batch" you still add 1 more roundtrip, meaning that is 2.5 roundtrips per hop, on the forwarding of a payment.

It gets worse when a failure happens at a later stage. Failures need to wait for the outgoing HTLC to be removed before we can consider asking the incoming HTLC to be removed, so the 1.5 roundtrips is again repeated at each hop going back. If we allow for filtering which messages we accept that rises to 2.5 roundtrips.

And forwarding payments and returning failures happen a lot more often than controlled shutdown of forwarding nodes does (because forwarding nodes want to be online as much as possible), so it makes sense to optimize the protocol for that, at the expense of controlled shutdown. The 1.5 roundtrips are the minimum safe possible if we want to remain robust against uncontrolled shutdown, too.

So it seems to me that you want some kind of protocol message to say "okay I am in graceful shutdown mode and will reject batches of updates that increase HTLCs". The counterparty might already have sent out some updates taht increase the number of HTLCs before it gets that message, because latency, so we need a code path to handle that somehow. Then the counterparty has to ensure that it does not make a batch of updates that increases the number of HTLCs --- and the easiest way to implement that, with fewer edge cases for bugs in implementation, is to simply stop all updates, which is not much different from, well, disconnecting.

@whitslack
Copy link
Collaborator Author

What I'm suggesting is this:

You                         Me
 |                          |
 |<-----update_add_htlc-----|
 |<---update_fulfill_htlc---|
 |<----commitment_signed----|
 |                          |
 |------refuse_update------>|
 |------accept_update------>|
 |----commitment_signed---->|
 |                          |
 |<-----revoke_and_ack------|
 |<----commitment_signed----|
 |                          |
 |------revoke_and_ack----->|
 |                          |

Yes, it's an additional one-way trip but only in this exceptional case. The common case would still be 1.5 round trips.

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 7, 2021

Well you would probably not want to do commitment_signed if you refuse_update in response to a commitment_signed. In general we do not perform updates from the counterparty on the counterparty commitment unless they have been applied on our own commitment first (which is what the commitment_signed from me gives, but you are rejecting it).

Also note that since I already gave the commitment_signed before you send refuse_update, your commitment with all my proposed updates is potentially valid and can be confirmed onchain, and your assertion that you are "refusing" it is only an assertion by you --- you could be lying to me, claiming to refuse, but in reality fully intending to use my signature after I accept your refusal and have deleted db information on my side, which might screw me over later on. Thus, the proposed protocol may hide CVE-level bugs if not implemented correctly, and probably needs a lot more thinking about to ensure that does not happen.

So your refuse_update needs to also include a revocation of the commitment I signed, and since commitments have a monotonic counter (and implementations might rely on that counter for other uses), this may be problematic, since the next commitment might no longer be n+1 where n is the current commitment being held.

In the case of disconnection, we have a channel_reestablish protocol, where we re-exchange the last signatures we sent, so it is safer to disconnect than to try to build yet another commitment at n+2 (or n+3, if you refuse_update yet again) while n is valid but n+1 has been revoked.

In short, your proposal hides a fair amount of complexity in the update state machine, I would suggest splitting this issue into two parts:

@whitslack
Copy link
Collaborator Author

Thank you for all the explanation. I was aware that you'd be committing to HTLCs that I'd subsequently be refusing, but my expectation is that I'd revoke that commitment after you send me a new one that includes only the HTLC that I accepted. That does imply that I'd be revoking two commitments at once, and you would have to assume that either of them could be used against you until I give you that revocation.

while n is valid but n+1 has been revoked.

Is that even possible? I thought the way SHA chains work is that a revocation of a particular state gives you the ability to derive revocations of all preceding states.

Anyway, yes, splitting this into two requests is clearly the way forward. The first half seems trivial to implement correctly, and it would go a long way toward reducing the adverse effects of taking down a popular routing node for maintenance.

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 7, 2021

while n is valid but n+1 has been revoked.

Is that even possible? I thought the way SHA chains work is that a revocation of a particular state gives you the ability to derive revocations of all preceding states.

Right right, that is a brain fart on my end, sorry.

Your proposal seems a workable one, but probably needs more eyes and brains on it. I would personally prefer a i_am_going_to_graceful_shutdown_soon and an acknowledging go_ahead_and_die (maybe not needed?), with a sort of "soft" request that the peer avoid update_add_htlc (i.e. the peer can respond with go_ahead_and_die but still keep adding update_add_htlc and we would just shrug and say "oh well", maybe some UNUSUAL-level log, but a proper implementation would avoid sending update_add_htlc), as that probably causes fewer problems / changes to existing code, especially considering multiple implementations.

As a soft request i_am_going_to_graceful_shutdown_soon might even be implemented as an odd message without needing feature bit negotiation, we would just optimistically send it to all channeled peers and hope they get the message.

@whitslack
Copy link
Collaborator Author

whitslack commented Oct 8, 2021

@ZmnSCPxj: Would the proposed Quiescence Protocol be of use here?

Hmm, maybe not, since that protocol, as proposed at least, prevents removing HTLCs as well as adding them. But maybe the proposal could be amended so "add" updates and "remove" updates are independently toggleable?

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 8, 2021

Hmm, maybe not, since that protocol, as proposed at least, prevents removing HTLCs as well as adding them. But maybe the proposal could be amended so "add" updates and "remove" updates are independently toggleable?

Correct, not directly useful.

The quiescence protocol currently does not specify how the quiescence period ends, and it is intended for changing the commitment protocol (i.e. switching from the current Poon-Dryja to a different variant, or adding an offchain "conversion transaction" to switch to e.g. Decker-Russell-Osuntokun, or maybe just allowing implementations on both sides to mutate their database for an "allow PTLCs" flag safely with as little chance of unexpected stuff happening). But maybe bits of its implementation can be used.

SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Oct 25, 2021
… in shutdown

Here important-plugin implies `important hook`.

Before this commit, when in shutdown:
- existing in-flight hooks where abandoned, cutting the hook-chain and never call hook_final_cb
- hooks where removed when its plugin died, even important-plugin because `shutdown` overrules
- but hook events can be called while waiting for plugins to self-terminate (up to 30s) and
  subdaemons still alive and it looks as if no plugin ever registered the hook.

After this commit, when in shutdown:
- existing in-flight hook (chains) are honoured and can finalize, same semantics as LD_STATE_RUNNING
- important-plugins are kept alive until after shutdown_subdaemons, so they don't miss hooks
- JSON RPC commands are functional, but anything unimportant-plugin related cannot be relied on

TODO:
- Run tests -> hangs forever on test_closing, so skip them

- Q. Does this open a can of worms or races when (normal) plugins with hooks die randomly?
  A. Yes, for example htlc_accepted calls triggers hook invoice_payment, but plugin (fetchinvoice?) already died

**
* CONCLUSION: If you want to give more control over shutdown, I think there could be
* a plugin `shutdown_clean.py` with RPC method `shutdown_clean`. When called, that
* plugin starts additional (important) plugin(s) that register relevant hooks and, for example, hold-off
* new htcl's and wait for existing inflight htlc's to resolve ... and finally call RPC `stop`.
*
* Note: --important-plugin only seems to work at start, not via `plugin start shutdown_clean.py`
* maybe we can add? Or do something with disable?
*
* Some parts of this commit is stil good, i.e. hook semantics of important plugins should be consistent
* untill the very last potential hook call.
**

- What if important-plugin dies unexpectatly and lightningd_exit() calls io_break() is that bad?
- What are the benefits? Add example where on shutdown inflight htlc's are resolved/cleared and
  new htlc's blocked, see ElementsProject#4842

- Split commit into hook-related stuff and others, for clarity of reasoning
- Q. How does this relate (hook-wise) to db_write plugins? A. Looks like this hook is treated like
  any other hook: when plugin dies, hook is removed, so to be safe backup needs to be `important`.
  Hook documentation does not mention `important-plugin` but BACKUP.md does.
  TODO: Tested this -> `plugin stop backup.py` -> "plugin-backup.py: Killing plugin: exited during normal operation"
  In fact, running current backup.py with current master misses a couple of writes in
  shutdown (because its hook is removed, see issue ElementsProject#4785).
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Nov 15, 2021
…ess to db

    because:
    - shutdown_subdaemons can trigger db write, comments in that function say so at least
    - resurrecting the main event loop with subdaemons still running is counter productive
      in shutting down activity (such as htlc's, hook_calls etc.)
    - custom behavior injected by plugins via hooks should be consistent, see test
      in previous commmit

    IDEA:

    in shutdown_plugins, when starting new io_loop:

    - A plugin that is still running can return a jsonrpc_request response, this triggers
      response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted

    - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which
      cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked

    - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification)
      as these do not trigger subdaemon calls or db_write's
      Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field

    PLAN (hypothesis):
    - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with
      an "id" field, this should
      block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?)

  Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can
     the response of a request done in one io_loop, be returned in the next?

TODO:
- Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop"
  can maybe act as an alternative for the "shutdown" notification for plugins to do their
  shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Nov 17, 2021
…ess to db

    because:
    - shutdown_subdaemons can trigger db write, comments in that function say so at least
    - resurrecting the main event loop with subdaemons still running is counter productive
      in shutting down activity (such as htlc's, hook_calls etc.)
    - custom behavior injected by plugins via hooks should be consistent, see test
      in previous commmit

    IDEA:

    in shutdown_plugins, when starting new io_loop:

    - A plugin that is still running can return a jsonrpc_request response, this triggers
      response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted

    - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which
      cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked

    - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification)
      as these do not trigger subdaemon calls or db_write's
      Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field

    PLAN (hypothesis):
    - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with
      an "id" field, this should
      block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?)

  Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can
     the response of a request done in one io_loop, be returned in the next?

TODO:
- Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop"
  can maybe act as an alternative for the "shutdown" notification for plugins to do their
  shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Nov 18, 2021
…ess to db

    because:
    - shutdown_subdaemons can trigger db write, comments in that function say so at least
    - resurrecting the main event loop with subdaemons still running is counter productive
      in shutting down activity (such as htlc's, hook_calls etc.)
    - custom behavior injected by plugins via hooks should be consistent, see test
      in previous commmit

    IDEA:

    in shutdown_plugins, when starting new io_loop:

    - A plugin that is still running can return a jsonrpc_request response, this triggers
      response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted

    - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which
      cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked

    - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification)
      as these do not trigger subdaemon calls or db_write's
      Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field

    PLAN (hypothesis):
    - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with
      an "id" field, this should
      block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?)

  Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can
     the response of a request done in one io_loop, be returned in the next?

TODO:
- Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop"
  can maybe act as an alternative for the "shutdown" notification for plugins to do their
  shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants