Skip to content

Conversation

@jpeach
Copy link
Contributor

@jpeach jpeach commented May 18, 2017

Get out of the signal handler as fast as possible, and use the event loop to signal graceful shutdown.

@jpeach jpeach requested review from maskit and zwoop May 18, 2017 05:32
@jpeach jpeach added this to the 8.0.0 milestone May 18, 2017
@jpeach
Copy link
Contributor Author

jpeach commented May 18, 2017

[approve ci]

proxy/Main.cc Outdated

static volatile bool sigusr1_received = false;
static volatile bool sigusr2_received = false;
static volatile bool signal_received[32];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we don't use RT signals, wouldn't it be better to use SIGRTMAX defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh it would be nice to have a constant there. NSIG is close, but I think that it is a Linux-ism. SIGRTMAX is usually 64, and we really only need to deal with the regular signals here, so it seemed like overkill :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to agree with @danobi - it's just a static array, probably worth the extra space to have a portable symbolic constant.

@maskit
Copy link
Member

maskit commented May 18, 2017

It seems nice, autest failed though. I'll take a look what's happening.

use the event loop to signal graceful shutdown.

to schedule (simple) shutdown? (signal of graceful shutdown is still the flag, right?)

Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works fine to me.

I approve this to check if Jenkins prevent accidental merge. (Don't merge if he doesn't.)

@maskit
Copy link
Member

maskit commented May 18, 2017

Jenkins did his job :)

@jpeach
Copy link
Contributor Author

jpeach commented May 18, 2017

We should figure out the autest problem before merging.

@maskit
Copy link
Member

maskit commented May 18, 2017

@jpeach Well, I just wanted to try Jenkins.

@jpeach
Copy link
Contributor Author

jpeach commented May 19, 2017

So the error that autest finds from the process manager in mgmt/ProcessManager.cc is:

      // handle EOF
      if (res == 0) {
        close_socket(local_manager_sockfd);
        if (!shutdown_event_system) {
          mgmt_fatal(0, "[ProcessManager::pollLMConnection] Lost Manager EOF!");
        }
      }

So this is a shutdown race or this patch perturbed the shutdown sequence in an unexpected way.

Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the issue too.

An easy way to figure out the race would be introducing another flag like prepare_shutdown and check it instead of shutdown_event_system in ProcessManager. It's not a perfect way but it should work as before.

I'm not a big fun of adding a new flag but I think we will need a flag which will be used instead of http2_drain.

@maskit
Copy link
Member

maskit commented May 19, 2017

I tried prepare_shutdown flag approach based on this. I had to change some conditions below and I'm not sure about its side effect, but at least log output is clean.

if (pmgmt->require_lm && !prepare_shutdown) {
       pmgmt->pollLMConnection();
if (require_lm && !prepare_shutdown && mgmt_write_pipe(local_manager_sockfd, (char *)mh, sizeof(MgmtMessageHdr) + mh->data_len) <= 0) {
for (count = 0; count < max_msgs_in_a_row && !prepare_shutdown; ++count) {

@zizhong
Copy link
Member

zizhong commented May 23, 2017

Thanks! @jpeach Finally we moved all the sleep()s out of the signal handler.

return;
} /* End ProcessManager::reconfigure */
if (RecGetRecordInt("proxy.config.process_manager.timeout", &timeout) != REC_ERR_OKAY) {
// Default to 5sec if the timeout is unspecified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny inconsistency between the comment and the code here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems correct to me. Has this been fixed already?

@jpeach
Copy link
Contributor Author

jpeach commented May 23, 2017

@maskit The actual problem with the test failures is that simply scanning for the string Fatal is not always detecting the error. When traffic_manager goes down, it is expected that traffic_server will read a EOF and also go down, which has the side-effect of printing an error with the string Fatal in it. I made most of these cases into level Alert, which is basically equivalent.

There's some cleanup to do around how Diags.cc calls ink_fatal_va() to really make this consistent. I think that's best left to another PR.

@maskit
Copy link
Member

maskit commented May 23, 2017

@jpeach Does that mean there's no bad EOF cases? I agree that the Fatal during graceful shutdown was false alert, but I'm not sure whether we can just change it to Alert. This is the reason I added prepare_shutdown, and I guess @dragon512 added the check for shutdown_event_system for the same reason.
a85a5a6

@jpeach
Copy link
Contributor Author

jpeach commented May 23, 2017

If traffic_manager crashes it is a problem. If traffic_manager exited on SIGTERM before traffic_server could get to it, it's not a problem. As you can see there's no way to tell the difference. I don't think that more globals to add nuance to the shutdown is really helping :)

@maskit
Copy link
Member

maskit commented May 23, 2017

Right, it seems we are on the same page. The global flag would be just a small hint, but not a perfect way.
I'm going approve this after some testing.

@maskit
Copy link
Member

maskit commented May 23, 2017

I don't no why but I still see Fatal: exiting with EOF from process manager in traffic.out with the latest patch, Jenkins looks happy though.

I also see [May 24 07:51:44.651] Server {0x2b706d4a4700} ALERT: exiting with EOF from process manager in digs.log

@jpeach
Copy link
Contributor Author

jpeach commented May 23, 2017

The lowercase Fatal comes from ink_fatal_va and is hard-coded. We should think how to clean that up separately.

@maskit
Copy link
Member

maskit commented May 23, 2017

Got it, .. but it seems ATS shuts down immediately right after receiving SIGTERM.

@jpeach
Copy link
Contributor Author

jpeach commented May 24, 2017

@maskit I hoisted the stop() a little earlier. Lemme know what you think.

@jpeach
Copy link
Contributor Author

jpeach commented May 30, 2017

@maskit Can you please take another look?

@maskit
Copy link
Member

maskit commented May 30, 2017

It seems something is wrong. With this change, I don't see logs like below

[May 30 14:52:48.132] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:248 (pollLMConnection)> (pmgmt) [ProcessManager::pollLMConnection] Message: '10009'
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:271 (pollLMConnection)> (pmgmt) [ProcessManager::pollLMConnection] enqueued 1 of max 10000 messages in a row
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:140 (processEventQueue)> (pmgmt) [ProcessManager] ==> Processing event id '10009' payload=272
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:163 (processSignalQueue)> (pmgmt) [ProcessManager] ==> Signalling local manager '16'
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:163 (processSignalQueue)> (pmgmt) [ProcessManager] ==> Signalling local manager '16'

and see this error message.

$ sudo /opt/trafficserver/bin/traffic_ctl server stop
traffic_ctl: server stop failed: [14] Generic Fail message (ie. CoreAPI call).

@dragon512
Copy link
Contributor

Is this going to try to make sure we flush the buffers? I know we are having issues with buffers not flushing and issues with code coverage tests failing as we have to kill ats to shut it down. which prevents code coverage data from being written correctly to disk

@jpeach
Copy link
Contributor Author

jpeach commented May 30, 2017

@dragon512 this will help address this, depending on how you kill but it is not at all clear to me how autest manages the processes. It is possible to make code coverage work (see the ci/coverage script) and this can help.

@jpeach
Copy link
Contributor Author

jpeach commented May 30, 2017

@zwoop is ICC busted?

@dragon512
Copy link
Contributor

autest by default will try to a ctrl-c then a term signal after a delay. I think I have a delay set to zero for ATS as before ats would never shutdown based on the ctrl-c being passes.. not sure what was going on here.

@jpeach
Copy link
Contributor Author

jpeach commented May 30, 2017 via email

@dragon512
Copy link
Contributor

I making a new drop of AuTest today.. I going to add some --verbose messages to help clarify what is going on.

As far as AuTest. It does basically a:

os.killpg(pgid, signal.SIGINT) # pgid is a process group setup for the process we started
do a configurable wait to allow the process to shutdown cleanly
os.killpg(pgid, signal.SIGKILL) # die die die

What do you suggest I should use instead of sigint?

Clean up the ProcessManager so remove use of mgmt_log. Refactor to improve
comment and code legibility. Move the code to tear down the message queue
from the destructor into the stop() function, and capture the poll thread
so that we can join it and stop the manager relatively gracefully.
jpeach added 4 commits June 3, 2017 15:03
Simplify ProcessManager message handling to make it clearer when
we are stopping and wen we are dealing with a process manager error
pumping messages to traffic_manager. We do this by hoisting the
message reading cose into a helper function that just deals with
reading the data, and propagating the error up to the manager thread.
At that point, we can more easily know whether we are shutting down
or not.
If we are trying to stop the ProcessManager, the poll thread might
be sleeping, so we now send it SIGINT to break the sleep. This
causes it to notice that it should no longer be running earlier.
Rather than sleeping in a signal handler, unify the signal handling
so that we always bounce the actual work up to the SignalContinuation.
Once we are there, then we can send a timed event to shutdown after the
timeot, or just cause a normal shutdown immediately.

On the exit path, stop the ProcessManager to keep the integration
tests happy.
@jpeach
Copy link
Contributor Author

jpeach commented Jun 3, 2017

@dragon512 send SIGTERM to traffic_cop when you want things to die.

@zwoop @maskit please review :)

This reverts commit 1f54cf0.

We no longer need to explicitly flush coverage data, because we don't
_exit(2) from signal handler any more.
@jpeach jpeach requested a review from shinrich June 4, 2017 18:54
@jpeach
Copy link
Contributor Author

jpeach commented Jun 4, 2017

@shinrich This branch reverts 1f54cf0 because we don't _exit() any more.

Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

I'm ok with this, but please consider using AutoStopCont instead of Alert().

if (mh->msg_id == MGMT_EVENT_SHUTDOWN) {
mgmt_fatal(0, "[ProcessManager::processEventQueue] Shutdown msg received, exiting\n");
} /* Exit on shutdown */
Alert("exiting on shutdown message");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think scheduling AutoStop in mgmt_restart_shutdown_callback instead of immediate shutdown with Alert() here? I think using Alert() for a normal exit is still too much. Handling a shutdown event as the same as SIGTERM makes normal shutdown process more consistent.

@shinrich
Copy link
Member

shinrich commented Jun 5, 2017

Reverting the gcov flush sounds fine to me as long as the coverage data gets pushed out during the testing process.

@jpeach jpeach merged commit da1ce2c into apache:master Jun 6, 2017
@maskit maskit modified the milestones: 7.1.0, 8.0.0 Jun 9, 2017
@maskit maskit added the Backport Marked for backport for an LTS patch release label Jun 9, 2017
@masaori335 masaori335 modified the milestones: 7.1.0, 8.0.0 Feb 14, 2018
@masaori335
Copy link
Contributor

Changed milestone to 8.0.0, because 7.1.0 doesn't have this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport Marked for backport for an LTS patch release Cleanup Documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants