-
Notifications
You must be signed in to change notification settings - Fork 844
Improve graceful shutdown. #1937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[approve ci] |
proxy/Main.cc
Outdated
|
|
||
| static volatile bool sigusr1_received = false; | ||
| static volatile bool sigusr2_received = false; | ||
| static volatile bool signal_received[32]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we don't use RT signals, wouldn't it be better to use SIGRTMAX defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh it would be nice to have a constant there. NSIG is close, but I think that it is a Linux-ism. SIGRTMAX is usually 64, and we really only need to deal with the regular signals here, so it seemed like overkill :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to agree with @danobi - it's just a static array, probably worth the extra space to have a portable symbolic constant.
|
It seems nice, autest failed though. I'll take a look what's happening.
to schedule (simple) shutdown? (signal of graceful shutdown is still the flag, right?) |
maskit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works fine to me.
I approve this to check if Jenkins prevent accidental merge. (Don't merge if he doesn't.)
|
Jenkins did his job :) |
|
We should figure out the autest problem before merging. |
|
@jpeach Well, I just wanted to try Jenkins. |
|
So the error that So this is a shutdown race or this patch perturbed the shutdown sequence in an unexpected way. |
maskit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed the issue too.
An easy way to figure out the race would be introducing another flag like prepare_shutdown and check it instead of shutdown_event_system in ProcessManager. It's not a perfect way but it should work as before.
I'm not a big fun of adding a new flag but I think we will need a flag which will be used instead of http2_drain.
|
I tried |
|
Thanks! @jpeach Finally we moved all the |
| return; | ||
| } /* End ProcessManager::reconfigure */ | ||
| if (RecGetRecordInt("proxy.config.process_manager.timeout", &timeout) != REC_ERR_OKAY) { | ||
| // Default to 5sec if the timeout is unspecified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny inconsistency between the comment and the code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems correct to me. Has this been fixed already?
|
@maskit The actual problem with the test failures is that simply scanning for the string There's some cleanup to do around how |
|
@jpeach Does that mean there's no bad |
|
If |
|
Right, it seems we are on the same page. The global flag would be just a small hint, but not a perfect way. |
|
I don't no why but I still see I also see |
|
The lowercase |
|
Got it, .. but it seems ATS shuts down immediately right after receiving SIGTERM. |
|
@maskit I hoisted the |
|
@maskit Can you please take another look? |
|
It seems something is wrong. With this change, I don't see logs like below and see this error message. |
|
Is this going to try to make sure we flush the buffers? I know we are having issues with buffers not flushing and issues with code coverage tests failing as we have to kill ats to shut it down. which prevents code coverage data from being written correctly to disk |
|
@dragon512 this will help address this, depending on how you kill but it is not at all clear to me how |
|
@zwoop is ICC busted? |
|
autest by default will try to a ctrl-c then a term signal after a delay. I think I have a delay set to zero for ATS as before ats would never shutdown based on the ctrl-c being passes.. not sure what was going on here. |
|
On May 30, 2017, at 10:12 AM, dragon512 ***@***.***> wrote:
autest by default will try to a ctrl-c then a term signal after a delay. I think I have a delay set to zero for ATS as before ats would never shutdown based on the ctrl-c being passes.. not sure what was going on here.
The signal Ctrl-C sends depends on the terminal setup. What signal is `autest` actually sending and which process is it sending it to? If it is sending SIGINT, then that has never caused a shutdown. I don't know if documentation is the right solution, but somehow we need to make the `autest` behaviour and expectations more obvious.
|
|
I making a new drop of AuTest today.. I going to add some --verbose messages to help clarify what is going on. As far as AuTest. It does basically a: os.killpg(pgid, signal.SIGINT) # pgid is a process group setup for the process we started What do you suggest I should use instead of sigint? |
Clean up the ProcessManager so remove use of mgmt_log. Refactor to improve comment and code legibility. Move the code to tear down the message queue from the destructor into the stop() function, and capture the poll thread so that we can join it and stop the manager relatively gracefully.
Simplify ProcessManager message handling to make it clearer when we are stopping and wen we are dealing with a process manager error pumping messages to traffic_manager. We do this by hoisting the message reading cose into a helper function that just deals with reading the data, and propagating the error up to the manager thread. At that point, we can more easily know whether we are shutting down or not.
If we are trying to stop the ProcessManager, the poll thread might be sleeping, so we now send it SIGINT to break the sleep. This causes it to notice that it should no longer be running earlier.
Rather than sleeping in a signal handler, unify the signal handling so that we always bounce the actual work up to the SignalContinuation. Once we are there, then we can send a timed event to shutdown after the timeot, or just cause a normal shutdown immediately. On the exit path, stop the ProcessManager to keep the integration tests happy.
|
@dragon512 send |
This reverts commit 1f54cf0. We no longer need to explicitly flush coverage data, because we don't _exit(2) from signal handler any more.
maskit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me.
I'm ok with this, but please consider using AutoStopCont instead of Alert().
| if (mh->msg_id == MGMT_EVENT_SHUTDOWN) { | ||
| mgmt_fatal(0, "[ProcessManager::processEventQueue] Shutdown msg received, exiting\n"); | ||
| } /* Exit on shutdown */ | ||
| Alert("exiting on shutdown message"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think scheduling AutoStop in mgmt_restart_shutdown_callback instead of immediate shutdown with Alert() here? I think using Alert() for a normal exit is still too much. Handling a shutdown event as the same as SIGTERM makes normal shutdown process more consistent.
|
Reverting the gcov flush sounds fine to me as long as the coverage data gets pushed out during the testing process. |
|
Changed milestone to 8.0.0, because 7.1.0 doesn't have this. |
Get out of the signal handler as fast as possible, and use the event loop to signal graceful shutdown.