-
-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful controller shutdown #573
Graceful controller shutdown #573
Conversation
- BREAKING: `controller::applier` now starts a graceful shutdown when the `queue` terminates | ||
- BREAKING: `scheduler` now shuts down immediately when `requests` terminates, rather than waiting for the pending reconciliations to drain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect these might have some impact on people's tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have approved because it's ultimately clean, and ready to use in the current form.
left some comments here and there, nothing block-worthy. but maybe there is a better default-path that we can take with letting users auto-install a sigterm handler.
examples/configmapgen_controller.rs
Outdated
Controller::new(cmgs, ListParams::default()) | ||
.owns(cms, ListParams::default()) | ||
.reconcile_all_on(reload_rx.map(|_| ())) | ||
.graceful_shutdown_on(graceful_shutdown_rx.map(|_| ())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I think it's great that we are able to expose this, it's also a bit boilerplatey for the standard use case. if we are targetting a kubernetes deployed controller then the shutdown signal is always going to be SIGTERM
.
i think the the method here makes complete sense for configurability, but maybe we could also have a Controller::install_sigterm_handler()
method that sets this up the same thing under the hood using tokio::signal
inside the building step of the Controller
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about shutdown_on_sigint
or shutdown_on_ctrl_c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel ctrl_c
and sigint
naming is strange if we are putting it inside kubernetes which sends SIGTERM
.
but it looks like we need to use SIGINT
locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in-cluster (or when running in systemd or similar) we want SIGTERM, when running from cargo we want SIGINT (or whatever the Windows equivalent is). We should be safe if we just treat the two as equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, if we can get tokio::signal to listen on both, then that would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Controller::manage_termination_signals()
and make it trigger on either SIGINT
or SIGTERM
?
Interestingly, actix-web has an opt-out for this scenario. That might also be somewhere to go to, but that's probably too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO installing signal handlers without asking for permission is a bit presumptuous if we aren't sure that we own the whole process. If we had a #[kube_runtime::main]
or similar I'd be for that doing it, but not in the current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a helper now for managing the signal handling for you.
examples/configmapgen_controller.rs
Outdated
} | ||
}) | ||
.boxed(), | ||
forceful_shutdown.boxed(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although you do seem to need a select on the controller future vs. the shutdown future. That feels a bit subtle. Do you not want the controller to complete its outstanding items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the difference. The first ctrl+c
initiates the graceful shutdown (by resolving graceful_shutdown_rx
), the second means that we just want the process to die die die ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to codify the latter into the app? Kubernetes will give us SIGTERM
, wait 30s, then send a SIGKILL
. We're don't really get to do anything after the SIGKILL
, so it might not be worth trying to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running in-cluster, you're correct (though that actually raises the problem that iirc the example only listens for SIGINT, not SIGTERM).
When running locally, well, it's annoying to have to switch tab and pkill
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we don't have a way of propagating a forceful termination within the applier. got it. so for the example we need the full setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we don't have a way of propagating a forceful termination within the applier. got it. so for the example we need the full setup.
Yeah, the whole point of the forceful termination would be to bypass everything and burn it to the ground.
wait, actually, do we need the
forceful_shutdown
future? the async block that creates it is doingprocess::exit
on the secondctrl-c
? wouldn't that just stop everything?
We need something that waits for waits for the second signal, but doesn't trigger for the first. That means that we need to keep them in the same "path", so to speak.
The std::process::exit
is also not strictly necessary, there's a difference between forceful and forceful. Essentially, you could say that we have six potential "levels" of grace that we could potentially implement:
- The Kronblom shutdown: when we initiate a shutdown, stop taking new scheduling requests, but let all currently scheduled reconciliations run and finish
- This is what
scheduler
currently implements inmaster
(before this PR), but due toapplier
's circular nature this isn't actually usable inapplier
anyway (since it doesn't have the cutoff that this PR implements) - Depending on whether we still allow retries to be scheduled, this may never terminate
- This is what
- The slightly overcooked shutdown: like the above, but only let currently pending reconciliations run (that is, they have already expired, but haven't started yet for whatever reason) while dropping reconciliations that are scheduled into the future
- The graceful shutdown: wait for all running reconciliations to finish, but do not start any new ones
- This is what this PR calls a "graceful" shutdown
- The forceful shutdown: abort all currently running reconciliations, but wait for them to cancel orderly (essentially: wait for them to hit the next
.await
) - The Brütal shutdown:
std::process::exit
- This is what the example calls a "forceful" shutdown
- The Spın̈al shutdown: you didn't need that computer anyway, did you?
From this list, this PR adds support for the graceful shutdown, while the forceful, Brütal, and Spın̈al shutdowns were already supported (kind of unavoidably :P) but undocumented. The example currently uses a Brütal shutdown (which was mostly a vestige from tokio::io::stdin
using an uncancellable background worker task), but could be downgraded to a forceful shutdown.
The Kronblom and overcooked shutdowns would (IMO) mostly be useful for testing runtime internals, and this PR replaces those cases with sleeps (which are collapsed by tokio's testing mode anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah. That's a solid classification 🤘
Yeah, that sounds sensible. I think the graceful and the brutal ones are likely the most useful ones for us (maybe forceful as well, and as you say, overcooked ones for testing).
Everything in the PR so far looks sensible to me. But I'm still a bit unsure about the main example here:
If we are currently in a brutal scenario, what good would does the last select!
in main
do? If you removed the process::exit
, and instead defer to the forceful_shutdown
which i assume is intended to trigger at the end of the async double-ctrl-c wait scope, then that's just immediately triggered in that last select!
instead, right? i don't think it would functionally cause any different behaviour to avoid process::exit
. Or am i misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it would functionally cause any different behaviour to avoid process::exit
Controller
tries to abort all reconciliations when dropped, and #[tokio::main]
waits for all spawned tasks to finish before exiting the process after the main function returns.
This combines to give you a forceful shutdown, rather than brutal (according to the previous chart :P).
we are currently in a brutal scenario, what good would does the last select! in main do?
Regardless of whether we call std::process::exit
in it, something needs to poll it for it to actually do anything. And we can't just spawn
it, since that would keep the graceful shutdown waiting for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh. It's because futures does nothing unless polled. Bahhh. Sorry, I was being thick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Looks great. Thanks so much. Great default path available and a custom path that's well documented. Awesome PR! Tiny nit: info messages from kube-runtime might not be super popular. I would personally downgrade those to debug. But feel free to merge at your leisure. |
…ntroller-shutdown
The new info messages are intended towards operators, and developers don't really have a good way to hook in there atm. That said, anyone who does want to silence them can just set their a kube-runtime-specific logging level. |
Ok. Let's leave it as is 👍 |
released in 0.58 :-) |
Fixes #552