Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful controller shutdown #573

Merged
merged 4 commits into from
Jul 2, 2021

Conversation

nightkr
Copy link
Member

@nightkr nightkr commented Jun 30, 2021

Fixes #552

@nightkr nightkr added the runtime controller runtime related label Jun 30, 2021
@nightkr nightkr requested review from clux and kazk June 30, 2021 00:06
Comment on lines +8 to +9
- BREAKING: `controller::applier` now starts a graceful shutdown when the `queue` terminates
- BREAKING: `scheduler` now shuts down immediately when `requests` terminates, rather than waiting for the pending reconciliations to drain
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect these might have some impact on people's tests

Copy link
Member

@clux clux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have approved because it's ultimately clean, and ready to use in the current form.

left some comments here and there, nothing block-worthy. but maybe there is a better default-path that we can take with letting users auto-install a sigterm handler.

Controller::new(cmgs, ListParams::default())
.owns(cms, ListParams::default())
.reconcile_all_on(reload_rx.map(|_| ()))
.graceful_shutdown_on(graceful_shutdown_rx.map(|_| ()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I think it's great that we are able to expose this, it's also a bit boilerplatey for the standard use case. if we are targetting a kubernetes deployed controller then the shutdown signal is always going to be SIGTERM.

i think the the method here makes complete sense for configurability, but maybe we could also have a Controller::install_sigterm_handler() method that sets this up the same thing under the hood using tokio::signal inside the building step of the Controller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about shutdown_on_sigint or shutdown_on_ctrl_c?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel ctrl_c and sigint naming is strange if we are putting it inside kubernetes which sends SIGTERM.
but it looks like we need to use SIGINT locally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in-cluster (or when running in systemd or similar) we want SIGTERM, when running from cargo we want SIGINT (or whatever the Windows equivalent is). We should be safe if we just treat the two as equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if we can get tokio::signal to listen on both, then that would be ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Controller::manage_termination_signals() and make it trigger on either SIGINT or SIGTERM?

Interestingly, actix-web has an opt-out for this scenario. That might also be somewhere to go to, but that's probably too early.

Copy link
Member Author

@nightkr nightkr Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO installing signal handlers without asking for permission is a bit presumptuous if we aren't sure that we own the whole process. If we had a #[kube_runtime::main] or similar I'd be for that doing it, but not in the current state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper now for managing the signal handling for you.

examples/configmapgen_controller.rs Outdated Show resolved Hide resolved
}
})
.boxed(),
forceful_shutdown.boxed(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although you do seem to need a select on the controller future vs. the shutdown future. That feels a bit subtle. Do you not want the controller to complete its outstanding items?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the difference. The first ctrl+c initiates the graceful shutdown (by resolving graceful_shutdown_rx), the second means that we just want the process to die die die ASAP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to codify the latter into the app? Kubernetes will give us SIGTERM, wait 30s, then send a SIGKILL. We're don't really get to do anything after the SIGKILL, so it might not be worth trying to handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running in-cluster, you're correct (though that actually raises the problem that iirc the example only listens for SIGINT, not SIGTERM).

When running locally, well, it's annoying to have to switch tab and pkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we don't have a way of propagating a forceful termination within the applier. got it. so for the example we need the full setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we don't have a way of propagating a forceful termination within the applier. got it. so for the example we need the full setup.

Yeah, the whole point of the forceful termination would be to bypass everything and burn it to the ground.

wait, actually, do we need the forceful_shutdown future? the async block that creates it is doing process::exit on the second ctrl-c? wouldn't that just stop everything?

We need something that waits for waits for the second signal, but doesn't trigger for the first. That means that we need to keep them in the same "path", so to speak.

The std::process::exit is also not strictly necessary, there's a difference between forceful and forceful. Essentially, you could say that we have six potential "levels" of grace that we could potentially implement:

  1. The Kronblom shutdown: when we initiate a shutdown, stop taking new scheduling requests, but let all currently scheduled reconciliations run and finish
    • This is what scheduler currently implements in master (before this PR), but due to applier's circular nature this isn't actually usable in applier anyway (since it doesn't have the cutoff that this PR implements)
    • Depending on whether we still allow retries to be scheduled, this may never terminate
  2. The slightly overcooked shutdown: like the above, but only let currently pending reconciliations run (that is, they have already expired, but haven't started yet for whatever reason) while dropping reconciliations that are scheduled into the future
  3. The graceful shutdown: wait for all running reconciliations to finish, but do not start any new ones
    • This is what this PR calls a "graceful" shutdown
  4. The forceful shutdown: abort all currently running reconciliations, but wait for them to cancel orderly (essentially: wait for them to hit the next .await)
  5. The Brütal shutdown: std::process::exit
    • This is what the example calls a "forceful" shutdown
  6. The Spın̈al shutdown: you didn't need that computer anyway, did you?

From this list, this PR adds support for the graceful shutdown, while the forceful, Brütal, and Spın̈al shutdowns were already supported (kind of unavoidably :P) but undocumented. The example currently uses a Brütal shutdown (which was mostly a vestige from tokio::io::stdin using an uncancellable background worker task), but could be downgraded to a forceful shutdown.

The Kronblom and overcooked shutdowns would (IMO) mostly be useful for testing runtime internals, and this PR replaces those cases with sleeps (which are collapsed by tokio's testing mode anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah. That's a solid classification 🤘

Yeah, that sounds sensible. I think the graceful and the brutal ones are likely the most useful ones for us (maybe forceful as well, and as you say, overcooked ones for testing).

Everything in the PR so far looks sensible to me. But I'm still a bit unsure about the main example here:

If we are currently in a brutal scenario, what good would does the last select! in main do? If you removed the process::exit, and instead defer to the forceful_shutdown which i assume is intended to trigger at the end of the async double-ctrl-c wait scope, then that's just immediately triggered in that last select! instead, right? i don't think it would functionally cause any different behaviour to avoid process::exit. Or am i misunderstanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it would functionally cause any different behaviour to avoid process::exit

Controller tries to abort all reconciliations when dropped, and #[tokio::main] waits for all spawned tasks to finish before exiting the process after the main function returns.

This combines to give you a forceful shutdown, rather than brutal (according to the previous chart :P).

we are currently in a brutal scenario, what good would does the last select! in main do?

Regardless of whether we call std::process::exit in it, something needs to poll it for it to actually do anything. And we can't just spawn it, since that would keep the graceful shutdown waiting for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh. It's because futures does nothing unless polled. Bahhh. Sorry, I was being thick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we all?

kube-runtime/src/controller/mod.rs Show resolved Hide resolved
kube-runtime/src/controller/mod.rs Outdated Show resolved Hide resolved
kube-runtime/src/scheduler.rs Show resolved Hide resolved
kube-runtime/src/scheduler.rs Show resolved Hide resolved
kube-runtime/src/utils.rs Show resolved Hide resolved
kube-runtime/src/watcher.rs Outdated Show resolved Hide resolved
Copy link
Member

@kazk kazk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@nightkr nightkr requested a review from clux July 1, 2021 17:08
@clux
Copy link
Member

clux commented Jul 1, 2021

Looks great. Thanks so much. Great default path available and a custom path that's well documented. Awesome PR!

Tiny nit: info messages from kube-runtime might not be super popular. I would personally downgrade those to debug. But feel free to merge at your leisure.

@nightkr
Copy link
Member Author

nightkr commented Jul 2, 2021

The new info messages are intended towards operators, and developers don't really have a good way to hook in there atm. That said, anyone who does want to silence them can just set their a kube-runtime-specific logging level.

@clux
Copy link
Member

clux commented Jul 2, 2021

Ok. Let's leave it as is 👍

@nightkr nightkr merged commit c84110f into kube-rs:master Jul 2, 2021
@nightkr nightkr deleted the feature/graceful-controller-shutdown branch July 2, 2021 03:47
@clux
Copy link
Member

clux commented Jul 5, 2021

released in 0.58 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
runtime controller runtime related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graceful controller shutdown
3 participants