-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Implement graceful shutdown #967
✨ Implement graceful shutdown #967
Conversation
01d7292
to
551da56
Compare
wooo! compound German words for things are are weird/hard/take many words to express in English! Sorry for the delay in reviewing this, gonna take a peek today/tomorrow |
551da56
to
b3556ca
Compare
Last changes look good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments inline. Most important one is prob around goroutine leaks caused by multiple errors simulanteously
pkg/manager/internal.go
Outdated
case <-allStopped: | ||
return nil | ||
case <-gracefulShutdownTimer.C: | ||
return fmt.Errorf("not all runnables have stopped within the grace period of %s", cm.gracefulShutdownTimeout.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a typed error (see above note about detected deadline-exceeded errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, no need for the call to .String
if the thing is already stringable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/manager/internal.go
Outdated
// else that keeps the binary running might allow controllers that | ||
// need leader election to run after we lost the lease. | ||
log.Info("leader lease lost, exiting") | ||
os.Exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise a panic in case we've got defers and other stuff folks want to finish? Nice for cleanup of temporary files and such, esp when testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that entering this is part of a normal shutdown of a Manager
that has leader election enabled, so panicing here would mean that all controller-managers that have LE enabled will panic on shutdown. For tests, the whole func can be overriden.
I am not really sure what best to do here. The os.Exit
is not nice and we also always exit with the same code, because there are different scenarios this has to cover for but which we can not distinguish:
- We lost the leader lock unexpectedly
- We just ended normally without any error
- We had an error during graceful shutdown (we will just swallow that because its returned from the
manager
but by the time that happens we will already have exited the binary here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe push the problem down by documenting "You must exit the binary immediately after the manager ended when leader election is enabled" (and removing the os.Exit())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively provide a way for external users to configure this, and default it to os.Exit
? The immediate Exit
is just annoying when dealing with stuff like defers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so after thinking about this some more, I still think what we do today is not correct and unsafe, but its not trivial to improve this so I would like to have a dedicated discussion and PR for that.
I have updated this PR to disable gracefulShutdown when the leader election is lost.
b3556ca
to
7889b3e
Compare
0953054
to
4a802fb
Compare
@DirectXMan12 @michaelgugino @vincepri updated, PTAL |
// from the errChan and waits for them to end. It must not be called more than once. | ||
func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error { | ||
var cancel context.CancelFunc | ||
if cm.gracefulShutdownTimeout > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should be >= 0 to be in line with the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really matter, we return if its zero and then close the context via the defer cancel()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out this patch locally, and I can confirm startNonLeaderElectionRunnables now waits for webhooks as it should with these changes with default Graceperiod.
However, if I set graceperiod to short amount (one second or less, I have tested), my tests hang indefinitely.
https://gist.github.com/michaelgugino/69757f35817a6ce7899874ccc8b24ba1
Also, I'm not sure I want an option that stops waiting (GracePeriod), I want an option that will wait for X seconds, and then force-terminate the child processes. Ignoring them is fruitless for a parent process that does not terminate (such as the unit tests I'm working with). I want to hard-close the webhook processes, I don't care if they finish or not.
Here's the PR where I'm attempting to utilize this. I've tried with out my patch, with my patch, and now this patch of controller-runtime.
https://github.com/openshift/machine-api-operator/pull/617/files
Anyway, even if we decide to use GracePeriod as we're doing in this patch, we should be more clear about goroutines left running in the background, and I think we should wait indefinitely by default (least surprise, I expect the manager to block until it's actually finished unless I specify otherwise). In any case, we need to investigate why these short waiting periods lock us up as it's having the exact opposite desired effect.
I tried to reproduce this here: alvaroaleman/machine-api-operator@b8ba348 However tests pass. Furthermore in the stacktrace I can not find any reference to controller-runtime. Can you provide a repro for this?
Yup I agree with that sentiment, but to my knowledge its not possible to terminate goroutines.
How would you make this more clear? Extend the godocs?
I disagree with this, because the result in case of a |
@alvaroaleman here's the reproducer: https://github.com/openshift/machine-api-operator/pull/620/files Make sure you have kubebuilder installed
and you can run the tests with
So, what I think is the essence of the problem here is that the kubeapi server is holding connections open to the webhook goroutine for one reason or another. There's not really any fundamental differences between my patch and yours, other than I'm running additional webhook handlers.
I was thinking, there must be some way to signal the go http.Server to stop immediately via a context or some method, or possibly some configurable on http.Server or one of the related components.
Yeah, in the documentation, be explicit that the goroutines maybe (will be if we hit the timeout?) left running in the background. It's kind of not really what I think most would expect by the name of the option. Maybe something like 'orphanGoRoutinesAfter' or something like that to indicate that we're doing something bad. Personally, I would prefer to block forever than leak goroutines. GracePeriod implies that we're going to wait for a connection to finish/cleanup, but if that time expires, we close the connection immediately. I'll see if I can create another reproducer to try to eliminate my changes specifically. |
@michaelgugino those are just bugs in your tests, you are waiting for channels that are never closed when the
That might be possible for the http server but is not possible for other
Rename it to Generally it is expected that the binary will end if the |
You're right. We weren't hitting this before as the manager never produced an error, and we never expected it to, but this new patch set returns an error. I'm kind of on the fence as to what the right thing is. On the one hand, the manager never really returned an error after the stop channel was closed, and the manager ended quickly, pretty much instantly. For webhooks, this is probably okay, as we probably just want to tear it down as quickly as possible, nobody is going to be too upset if a request goes unfulfilled. On the other hand, the existing behavior is definitely a bug, we should wait for the webhooks to stop. Let's consider the case where we wait indefinitely by default. The manager will block forever, and that might be unexpected and hard to diagnose. This might impact things that are otherwise working today, and a simple rebase onto this patch set is not going to signal any change to the end user as we're not breaking the interface. However, it would be pretty straight forward to see that the manager is blocking, though it might be hard to determine what it's waiting on. The flip side to this coin is, if anyone is eagerly restarting a manager without terminating a container (eg, they have the manager in some while loop, and they're catching a signal or something to break that loop), we're going to leak goroutines. I can't think of a great reason to do this in reality, but some users might be doing that. IMO, leaking goroutines is definitely worse than blocking forever as it's going to happen silently and eventually your containers probably going to crash. My preference is to block indefinitely by default, make the timeout opt-in, and specify in some reasonable way that the timeout is going to orphan goroutines, and those goroutines could potentially stick around forever unless the process is stopped. Personally, since I'm aware of this patch, I can live with either scenario. I think we should just consider the impact to other users that might not be aware of the change potentially. I'm not super firm on this position, but I do want to highlight it for consideration. Otherwise, this patch works well for me. |
For the record, leaking goroutines is what we already do today and not something this PR introduces. This PR just makes that visible. In my opinion, if a runnable doesn't stop within 30 seconds its unlikely to stop later on. We can not kill goroutines and just hanging is guaranteed to result in a lot of bug reports and questions, since it looks like a bug in c-r. Returning an error OTOH gives users a clue as to how to debug problems we can not solve for them (get all goroutines to stop) . |
Co-authored-by: david.benque <david.benque@datadoghq.com>
4a802fb
to
53289b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Gave another review today, the code changes look good and this should be ready to go.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/retest |
/test pull-controller-runtime-test-master |
This PR implements graceful shutdown.
It is a rebased version of #664 plus a set of "Verschlimmbesserungen":
close(cm.internalStopper)
but the cache kept it while waiting for sync, so if we shutdown before sync completed, this is a deadlockRunnables
as well so they are part of the graceful shutdown: https://github.com/kubernetes-sigs/controller-runtime/pull/664/files#r355783939runnableTearDownTimeout
togracefulShutdownTimeout
, IMHO the latter makes the idea clearer as its not using terms that are bound to concepts that only exist within c-rexit 1
on leader election loss. This is important, because otherwise controllers might continue to run after the leader lease was given up. Maybe we should only cancel the leader electioncontext
after graceful shutdown was finished?Supersedes and thereby closes #664
Fixes #764
/assign @DirectXMan12 @alexeldeib