-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Graceful shutdown #4842
Comments
N.B.: @ajpwahqgbi has attempted to write a C-Lightning graceful shutdown plugin, but it doesn't work correctly, as it apparently blocks the daemon's event loop and prevents any HTLCs from clearing out at all, thus actually exacerbating the problem it's trying to remedy. Also, it lacks a timeout. |
This may not be (perfectly) possible in terms of the BOLT protocol. In BOLT#2, there is an implicit ordering requirement. Suppose you are doing (or want to do, ha) a graceful shutdown, and I am your peer. There is an HTLC you already offered to me. Now, suppose I send the following messages to you, in the specified order:
You can either respond:
However, note that the first message adds an HTLC, which is further from your goal of reducing the number of HTLCs in-flight before you finalize shutdown. You cannot respond to just the second message and remove an HTLC without adding the first message HTLC. The first option (respond to neither message) is equivalent to just shutting down with a I think that properly speaking we may need a spec change, where a node tells its peers "no more -- However, we can (maybe?) offer a partial solution. We can have In this mode, This at least lets us have some chance to reduce the number of HTLCs without gaining more HTLCs, without having to modify the spec (which would take longer). |
@ZmnSCPxj: In your example, why not reply to the |
You cannot reply to So no, your only recourse is to drop the connection so that neither you nor the counterparty ever instantiate the new incoming HTLC. There is still the chance they reconnect within the grace period and then decide not to (In fact I believe we do this now, if we have a just-arrived HTLC that is to be forwarded, but before we actually have sent out the |
Damn. Who the hell designed this protocol? It has so many novice mistakes. Anyway, thanks for your explanations. |
N00b LN devs 4->5 years ago (all of us were n00bs, LOL). Mostly the concern then was to ensure that we could always have a fallback safely on-disk before doing potentially funds-losing operations, which is why the protocol was designed that way; graceful shutdowns were not a consideration, but sudden unexpected I-tripped-over-the-power-cord-sorrry shutdowns were, and the protocol was designed so that sudden disconnections are perfectly fine (which is why I suggest just disconnecting, the protocol is designed to survive that and that code path is well-tested in all implementations). It is much easier to implement and test your implementation if you always process messages in the order you receive them rather than supporting a "I can respond to that message but ignore this other message", especially if you need to save stuff on disk and then re-establish later with the peer from your saved on-disk stuff. Graceful shutdowns were not a concern, ungraceful kick-the-cord shutdowns were. |
For what it's worth, I don't mean that responding to protocol messages in order is a mistake. I only meant that having no mechanism for rejecting protocol messages (other than disconnecting) seems like an oversight. Similarly with the whole thing where you can't send an "error" to your channel peer without triggering them to force-close your channel. |
Well, I suppose I now have to detail it exactly, and the rationale. You can try reading through BOLT#2 too. Basically, I lied when I said your options are to respond to none of the updates, one of them, or both. In reality, multiple So your only real options, if the counterparty gave you
You ignore the batch by ignoring their The reason to batch is that some of the low-level networking protocols are much more amenable to having as few turnarounds as possible. Meaning you would prefer to have one side do multiple message sends, then the other side. For example it might use a shared media (Ethernet, WiFi), or the protocol might have high latency but high bandwidth (Tor, long-distance cable). For my example it would look like:
Now, what you want to do would be something like this:
Because of the need to turn around and accept each individual And a good part of the slowness in forwarding is due precisely to these turnarounds or roundtrips. Already with the current design (which is already optimized as well as we can figure out, barring stuff like my weird Fast Forwards idea that you can read about on the ML) you get the 1.5 roundtrips, and that is repeated at each hop in a payment route. Even with a "I only accept these particular changes in that batch" you still add 1 more roundtrip, meaning that is 2.5 roundtrips per hop, on the forwarding of a payment. It gets worse when a failure happens at a later stage. Failures need to wait for the outgoing HTLC to be removed before we can consider asking the incoming HTLC to be removed, so the 1.5 roundtrips is again repeated at each hop going back. If we allow for filtering which messages we accept that rises to 2.5 roundtrips. And forwarding payments and returning failures happen a lot more often than controlled shutdown of forwarding nodes does (because forwarding nodes want to be online as much as possible), so it makes sense to optimize the protocol for that, at the expense of controlled shutdown. The 1.5 roundtrips are the minimum safe possible if we want to remain robust against uncontrolled shutdown, too. So it seems to me that you want some kind of protocol message to say "okay I am in graceful shutdown mode and will reject batches of updates that increase HTLCs". The counterparty might already have sent out some updates taht increase the number of HTLCs before it gets that message, because latency, so we need a code path to handle that somehow. Then the counterparty has to ensure that it does not make a batch of updates that increases the number of HTLCs --- and the easiest way to implement that, with fewer edge cases for bugs in implementation, is to simply stop all updates, which is not much different from, well, disconnecting. |
What I'm suggesting is this:
Yes, it's an additional one-way trip but only in this exceptional case. The common case would still be 1.5 round trips. |
Well you would probably not want to do Also note that since I already gave the So your In the case of disconnection, we have a In short, your proposal hides a fair amount of complexity in the update state machine, I would suggest splitting this issue into two parts:
|
Thank you for all the explanation. I was aware that you'd be committing to HTLCs that I'd subsequently be refusing, but my expectation is that I'd revoke that commitment after you send me a new one that includes only the HTLC that I accepted. That does imply that I'd be revoking two commitments at once, and you would have to assume that either of them could be used against you until I give you that revocation.
Is that even possible? I thought the way SHA chains work is that a revocation of a particular state gives you the ability to derive revocations of all preceding states. Anyway, yes, splitting this into two requests is clearly the way forward. The first half seems trivial to implement correctly, and it would go a long way toward reducing the adverse effects of taking down a popular routing node for maintenance. |
Right right, that is a brain fart on my end, sorry. Your proposal seems a workable one, but probably needs more eyes and brains on it. I would personally prefer a As a soft request |
@ZmnSCPxj: Would the proposed Quiescence Protocol be of use here? Hmm, maybe not, since that protocol, as proposed at least, prevents removing HTLCs as well as adding them. But maybe the proposal could be amended so "add" updates and "remove" updates are independently toggleable? |
Correct, not directly useful. The quiescence protocol currently does not specify how the quiescence period ends, and it is intended for changing the commitment protocol (i.e. switching from the current Poon-Dryja to a different variant, or adding an offchain "conversion transaction" to switch to e.g. Decker-Russell-Osuntokun, or maybe just allowing implementations on both sides to mutate their database for an "allow PTLCs" flag safely with as little chance of unexpected stuff happening). But maybe bits of its implementation can be used. |
… in shutdown Here important-plugin implies `important hook`. Before this commit, when in shutdown: - existing in-flight hooks where abandoned, cutting the hook-chain and never call hook_final_cb - hooks where removed when its plugin died, even important-plugin because `shutdown` overrules - but hook events can be called while waiting for plugins to self-terminate (up to 30s) and subdaemons still alive and it looks as if no plugin ever registered the hook. After this commit, when in shutdown: - existing in-flight hook (chains) are honoured and can finalize, same semantics as LD_STATE_RUNNING - important-plugins are kept alive until after shutdown_subdaemons, so they don't miss hooks - JSON RPC commands are functional, but anything unimportant-plugin related cannot be relied on TODO: - Run tests -> hangs forever on test_closing, so skip them - Q. Does this open a can of worms or races when (normal) plugins with hooks die randomly? A. Yes, for example htlc_accepted calls triggers hook invoice_payment, but plugin (fetchinvoice?) already died ** * CONCLUSION: If you want to give more control over shutdown, I think there could be * a plugin `shutdown_clean.py` with RPC method `shutdown_clean`. When called, that * plugin starts additional (important) plugin(s) that register relevant hooks and, for example, hold-off * new htcl's and wait for existing inflight htlc's to resolve ... and finally call RPC `stop`. * * Note: --important-plugin only seems to work at start, not via `plugin start shutdown_clean.py` * maybe we can add? Or do something with disable? * * Some parts of this commit is stil good, i.e. hook semantics of important plugins should be consistent * untill the very last potential hook call. ** - What if important-plugin dies unexpectatly and lightningd_exit() calls io_break() is that bad? - What are the benefits? Add example where on shutdown inflight htlc's are resolved/cleared and new htlc's blocked, see ElementsProject#4842 - Split commit into hook-related stuff and others, for clarity of reasoning - Q. How does this relate (hook-wise) to db_write plugins? A. Looks like this hook is treated like any other hook: when plugin dies, hook is removed, so to be safe backup needs to be `important`. Hook documentation does not mention `important-plugin` but BACKUP.md does. TODO: Tested this -> `plugin stop backup.py` -> "plugin-backup.py: Killing plugin: exited during normal operation" In fact, running current backup.py with current master misses a couple of writes in shutdown (because its hook is removed, see issue ElementsProject#4785).
…ess to db because: - shutdown_subdaemons can trigger db write, comments in that function say so at least - resurrecting the main event loop with subdaemons still running is counter productive in shutting down activity (such as htlc's, hook_calls etc.) - custom behavior injected by plugins via hooks should be consistent, see test in previous commmit IDEA: in shutdown_plugins, when starting new io_loop: - A plugin that is still running can return a jsonrpc_request response, this triggers response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification) as these do not trigger subdaemon calls or db_write's Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field PLAN (hypothesis): - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with an "id" field, this should block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?) Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can the response of a request done in one io_loop, be returned in the next? TODO: - Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop" can maybe act as an alternative for the "shutdown" notification for plugins to do their shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
…ess to db because: - shutdown_subdaemons can trigger db write, comments in that function say so at least - resurrecting the main event loop with subdaemons still running is counter productive in shutting down activity (such as htlc's, hook_calls etc.) - custom behavior injected by plugins via hooks should be consistent, see test in previous commmit IDEA: in shutdown_plugins, when starting new io_loop: - A plugin that is still running can return a jsonrpc_request response, this triggers response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification) as these do not trigger subdaemon calls or db_write's Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field PLAN (hypothesis): - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with an "id" field, this should block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?) Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can the response of a request done in one io_loop, be returned in the next? TODO: - Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop" can maybe act as an alternative for the "shutdown" notification for plugins to do their shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
…ess to db because: - shutdown_subdaemons can trigger db write, comments in that function say so at least - resurrecting the main event loop with subdaemons still running is counter productive in shutting down activity (such as htlc's, hook_calls etc.) - custom behavior injected by plugins via hooks should be consistent, see test in previous commmit IDEA: in shutdown_plugins, when starting new io_loop: - A plugin that is still running can return a jsonrpc_request response, this triggers response_cb, which cannot be handled because subdaemons are gone -> so any response_cb should be blocked/aborted - jsonrpc is still there, so users (such as plugins) can make new jsonrpc_request's which cannot be handled because subdaemons are gone -> so new rpc_request should also be blocked - But we do want to send/receive notifications and log messages (handled in jsonrpc as jsonrpc_notification) as these do not trigger subdaemon calls or db_write's Log messages and notifications do not have "id" field, where jsonrpc_request *do* have an "id" field PLAN (hypothesis): - hack into plugin_read_json_one OR plugin_response_handle to filter-out json with an "id" field, this should block/abandon any jsonrpc_request responses (and new jsonrpc_requests for plugins?) Q. Can internal (so not via plugin) jsonrpc_requests *break* over an io_loop cycle? And if yes, can the response of a request done in one io_loop, be returned in the next? TODO: - Investigate solving the hang-issue with rpc_command hook. If that can be fixed, then hooking "json_stop" can maybe act as an alternative for the "shutdown" notification for plugins to do their shutdown/cleanup things (datastore etc.) see also issue ElementsProject#4842
@ZmnSCPxj explained:
It is certainly true that daemon shutdown cannot be deferred until all HTLCs are cleared, but a significant improvement in Lightning UX could be achieved by implementing a graceful shutdown that would block the addition of new HTLCs and wait for up to a specified timeout for all HTLCs to clear.
Feature Request
timeout
parameter to thestop
RPC to specify a graceful shutdown timeout.stop
begins executing, begin refusing all requests to add HTLCs to channels (both from peers and from local commands).Implementing this feature would reduce the occurrence of slow payment attempts for users of the Lightning Network. Rebooting a C-Lightning server can take several minutes, during which time any users with in-flight HTLCs must wait. This is bad UX and is not helping Lightning adoption. We can't easily fix HTLCs that go out to lunch and never return, but we can avoid dropping fresh HTLCs on the floor while we go out to lunch.
The text was updated successfully, but these errors were encountered: