Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support double signal handling #285

Merged

Conversation

kosyfrances
Copy link
Contributor

Fixes #220

case <-ctx.Done():
level.Info(logger).Log("msg", "context channel closed, exiting")
}
cancelFunc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that after calling cancelFunc, we should do <- term again followed by os.Exit(1) or something. So the second ctrl-c will quit the operator immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would be nice to figure out with calling cancelFunc twice won't give us problems. First call is after getting SIGTERM (or SIGINT) and the second call is deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first call, subsequent calls to a cancelFunc will do nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, see docs.

@kosyfrances kosyfrances force-pushed the kosy/double-signal-handling branch 2 times, most recently from 9ae26eb to 46ed0b2 Compare May 25, 2018 09:54
Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been reviewing this PR and I realized that for graceful shutdown of the operator using just cancelFunc() on SIGTERM is not enough. If we have main function like this:

func main() { 
	ctx, cancelFunc := context.WithCancel(context.Background())
	defer cancelFunc()

	if err := v1beta1(ctx, …); err != nil {
		…
		os.Exit(1)
	}

	if err := v1beta2(ctx, …); err != nil {
		…
		os.Exit(1)
	}

	term := make(chan os.Signal)
	// Relay these signals to the `term` channel.
	signal.Notify(term, syscall.SIGINT, syscall.SIGTERM)

	go func() {
		<-term
		level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...")
		cancelFunc()

		<-term
		os.Exit(1)
	}()

	<-ctx.Done()
	level.Info(logger).Log("msg", "context channel closed, exiting")

	os.Exit(0)
}

Then pressing ctrl-c will call cancelFunc(), which basically closes the ctx.Done() channel. Thus the main function will happily proceed to call os.Exit(0) clobbering all the goroutines that didn't have a chance to react to the closing of the context channel.

I think we also should use sync.WaitGroup. Like so, maybe:

func main() { 
	ctx, cancelFunc := context.WithCancel(context.Background())
	// we don't do defer cancelFunc() anymore
	var wg sync.WaitGroup

	wg.Add(2) // 2, for each controller we run.
	if err := v1beta1(ctx, &wg, …); err != nil {
		…
		os.Exit(1)
	}

	if err := v1beta2(ctx, &wg, …); err != nil {
		…
		os.Exit(1)
	}

	term := make(chan os.Signal)
	// Relay these signals to the `term` channel.
	signal.Notify(term, syscall.SIGINT, syscall.SIGTERM)

	go func() {
		<-term
		level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...")
		cancelFunc()

		<-term
		os.Exit(1)
	}()

	wg.Wait()
	level.Info(logger).Log("msg", "controllers stopped, exiting")

	os.Exit(0)
}

Each controller would then need to call wg.Done() when it is done. I suppose that by being "done" I should mean that all the goroutines the controller spawns reacted to the closing of the context channel, by calling controller_specific_wg.Done(), where controller_specific_wg is yet another WaitGroup, this time created by the controller.

So in short - use context.Context to tell goroutines to wrap up their work, and sync.WaitGroup to learn if they did so, so we can gracefully quit.

@@ -112,12 +112,17 @@ func run() int {
// Relay these signals to the `term` channel.
signal.Notify(term, syscall.SIGINT, syscall.SIGTERM)

select {
case <-term:
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make term a buffered channel of size 2, since we know how many signals it will receive.

select {
case <-term:
go func() {
<-term
level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be SIGINT, so I'd remove mention of the specific signal and just say: "received termination signal, exiting gracefully..."

}()

<-ctx.Done()
level.Info(logger).Log("msg", "context channel closed, exiting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is now redundant when exiting because of a signal. Are there cases when ctx gets closed and it's not because of signals?

I don't think so, so this message can probably go.

@asymmetric
Copy link
Contributor

Great writeup @krnowak!

@kosyfrances kosyfrances force-pushed the kosy/double-signal-handling branch 3 times, most recently from bc07d5a to aee6526 Compare May 31, 2018 13:35
@kosyfrances
Copy link
Contributor Author

@krnowak It's ready for another review

Copy link
Contributor

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough, unfortunately. What I wanted to achieve is that we wait for all the goroutines we spawn to finish, before we quit the operator. With the current changes we only wait for the 2 goroutines to finish, the go controller.Run(…) ones.

You can find the goroutines we spawn with git grep 'go ' -- cmd pkg/controller (note the space after go).

So, basically if you have a function that takes a context as a parameter, spawns a bunch of goroutines and that waits until context channel is closed, like this:

func foo(ctx context.Context, …) {
	go fun1(ctx);
	go fun2(ctx);
	for i := 0; i < X; i++) {
		go worker(ctx);
	}

	<- ctx.Done()
}

Then the change you could do is something like:

func foo(ctx context.Context, …) {
	wg := sync.WaitGroup{}

	wg.Add(2 + X)
	go func() {
		fun1(ctx);
		wg.Done()
	}()
	go func() {
		fun2(ctx);
		wg.Done()
	}()
	for i := 0; i < X; i++) {
		go func() {
			worker(ctx);
			wg.Done()
		}()
	}

	// I'm not even sure if this line is needed anymore,
	// because we wait for the waitgroup anyway
	<- ctx.Done()
	wg.Wait()
}

If the foo function itself is being run in the goroutine, then you could convert the go call into the same go func(){…}() like I did for fun1 inside foo.

The v1beta1 function in main.go doesn't have that <-ctx.Done() thing at its end, but you could transform the go call just the same. That way you don't need to modify the prototype of the Run() functions for example. But passing wg to v1beta1 or v1beta2 functions still makes sense, though.

The v1beta2 function is a bit more tricky, because it spawns 3 goroutines (2 for some factories and 1 for the controller). I think I would just something like:

factoriesWg := sync.WaitGroup
factoriesWg.Add(2)
go func() {
	someFactoryFunctionOrSomething1(ctx)
	factoriesWg.Done()
}()
go func() {
	someFactoryFunctionOrSomething2(ctx)
	factoriesWg.Done()
}()
go func() {
	controller.Run(…)
	factoriesWg.Wait()
	wg.Done()
}()

Hopefully it makes it clear what I meant by some controller_specific_wg in my previous writeup.

@krnowak
Copy link
Contributor

krnowak commented Jun 1, 2018

One thing I missed. The defer hc.queue.Shutdown() will likely need to stop being a defer and should be put between <-ctx.Done() and wg.Wait(). Otherwise it will probably deadlock…

@krnowak
Copy link
Contributor

krnowak commented Jun 1, 2018

Another thing - likely all the wg.Done() calls should be deferred, but wg.Wait() calls - not.

@kosyfrances kosyfrances changed the title Support double signal handling WIP: Support double signal handling Jun 4, 2018
@kosyfrances kosyfrances force-pushed the kosy/double-signal-handling branch 2 times, most recently from 033efaf to 3abe3a3 Compare June 4, 2018 15:42
@kosyfrances kosyfrances changed the title WIP: Support double signal handling Support double signal handling Jun 4, 2018
@kosyfrances
Copy link
Contributor Author

@krnowak Another one :)

defer cancelFunc()

var wg sync.WaitGroup
wg.Add(4)
Copy link
Contributor

@krnowak krnowak Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this number to remain 2. This is because we pass the waitgroup twice. I understand that you made it 4, because v1beta2 function spawns not one goroutine, but three, so you needed to call Wait four times, not two. Still, I would prefer to keep the number of spawned goroutines in the function as an implementation detail. That way it can be easier to reason about the code, like "we create a waitgroup with a counter set to 2 and pass it to two functions, so if I add a third function (like v1beta3), I need to bump this number to 3".

Right now the counter is 4, but we pass it to only two functions, so it may left the reader wondering why is that (do both functions decrement the counter twice?) and to be able to answer that, the reader needs to check both v1beta1 and v1beta2 functions to see what they do. Another problem - you need to remember to bump this number whenever you add another goroutine in either function, which is easy to forget, because the counter is not really in the vicinity of a body of either of the functions.

So, for the v1beta2 function, I posted a possible solution at #285 (review) (see the last snippet there, with the factoriesWg thing).

@@ -144,12 +156,15 @@ func v1beta1(ctx context.Context, cSets Clientsets, logger log.Logger) error {
return err
}

go controller.Run(runtime.NumCPU(), ctx)
go func() {
go controller.Run(runtime.NumCPU(), ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are at it, could please swap the order of parameters here? context.Context should be the first parameter of a function. I suppose this is some kind of an expectation from static analysis tools or something.

Others could say that passing ctx as a first parameter is idiomatic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and drop the go here. We don't want to spawn a goroutine inside the goroutine.


go controller.Run(runtime.NumCPU(), ctx)
go func() {
go controller.Run(runtime.NumCPU(), ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about swapping the order of parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go too.

@@ -232,7 +253,16 @@ func (hc *HabitatController) watchPods(ctx context.Context) {
DeleteFunc: hc.handlePodDelete,
})

go c.Run(ctx.Done())
var wg sync.WaitGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this looks wrong. It looks like an elaborate way of saying c.Run(ctx.Done()). ;)

What I think needs to be done there is that the watchPods function should receive waitgroup as a parameter, instead of creating it and it should not call wg.Wait() on it. Because right now, watchPods function will block, while it was meant to be a nonblocking function that spawns a goroutine.

go hc.deployInformer.Run(ctx.Done())
go hc.cmInformer.Run(ctx.Done())
var wg sync.WaitGroup
wg.Add(3 + workers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will need to be bumped to 4, if we start passing it to hc.watchPods too.

wg.Add(3 + workers)

go func() {
go hc.habInformer.Run(ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go.

}()

go func() {
go hc.stsInformer.Run(ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go.

}()

go func() {
go hc.cmInformer.Run(ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go.

@@ -176,12 +188,21 @@ func (hc *HabitatController) Run(workers int, ctx context.Context) error {
// failed job, it will be restarted after a delay of 1 second.
for i := 0; i < workers; i++ {
level.Debug(hc.logger).Log("msg", "Starting worker", "id", i)
go wait.Until(hc.worker, time.Second, ctx.Done())
go func() {
go wait.Until(hc.worker, time.Second, ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go.

wg.Add(1)

go func() {
go c.Run(ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop go.

@kosyfrances kosyfrances changed the title Support double signal handling WIP: Support double signal handling Jun 5, 2018
Signed-off-by: Kosy Anyanwu <kosy@kinvolk.io>
@kosyfrances
Copy link
Contributor Author

@krnowak Another one 💯

@krnowak
Copy link
Contributor

krnowak commented Jun 6, 2018

LFAD.

@krnowak krnowak changed the title WIP: Support double signal handling Support double signal handling Jun 6, 2018
@krnowak krnowak merged commit e2a1a49 into habitat-sh:master Jun 6, 2018
@krnowak krnowak deleted the kosy/double-signal-handling branch June 6, 2018 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants