Improve graceful shutdown. #1937

jpeach · 2017-05-18T05:32:00Z

Get out of the signal handler as fast as possible, and use the event loop to signal graceful shutdown.

jpeach · 2017-05-18T15:09:55Z

[approve ci]

danobi · 2017-05-18T15:26:26Z

proxy/Main.cc

-
-static volatile bool sigusr1_received = false;
-static volatile bool sigusr2_received = false;
+static volatile bool signal_received[32];


Even though we don't use RT signals, wouldn't it be better to use SIGRTMAX defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html ?

Yeh it would be nice to have a constant there. NSIG is close, but I think that it is a Linux-ism. SIGRTMAX is usually 64, and we really only need to deal with the regular signals here, so it seemed like overkill :)

I'd have to agree with @danobi - it's just a static array, probably worth the extra space to have a portable symbolic constant.

maskit · 2017-05-18T16:59:59Z

It seems nice, autest failed though. I'll take a look what's happening.

use the event loop to signal graceful shutdown.

to schedule (simple) shutdown? (signal of graceful shutdown is still the flag, right?)

maskit

It works fine to me.

I approve this to check if Jenkins prevent accidental merge. (Don't merge if he doesn't.)

maskit · 2017-05-18T20:53:06Z

Jenkins did his job :)

jpeach · 2017-05-18T20:53:09Z

We should figure out the autest problem before merging.

maskit · 2017-05-18T20:54:52Z

@jpeach Well, I just wanted to try Jenkins.

jpeach · 2017-05-19T02:06:46Z

So the error that autest finds from the process manager in mgmt/ProcessManager.cc is:

      // handle EOF
      if (res == 0) {
        close_socket(local_manager_sockfd);
        if (!shutdown_event_system) {
          mgmt_fatal(0, "[ProcessManager::pollLMConnection] Lost Manager EOF!");
        }
      }

So this is a shutdown race or this patch perturbed the shutdown sequence in an unexpected way.

maskit

I confirmed the issue too.

An easy way to figure out the race would be introducing another flag like prepare_shutdown and check it instead of shutdown_event_system in ProcessManager. It's not a perfect way but it should work as before.

I'm not a big fun of adding a new flag but I think we will need a flag which will be used instead of http2_drain.

maskit · 2017-05-19T19:47:51Z

I tried prepare_shutdown flag approach based on this. I had to change some conditions below and I'm not sure about its side effect, but at least log output is clean.

if (pmgmt->require_lm && !prepare_shutdown) {
       pmgmt->pollLMConnection();

if (require_lm && !prepare_shutdown && mgmt_write_pipe(local_manager_sockfd, (char *)mh, sizeof(MgmtMessageHdr) + mh->data_len) <= 0) {

for (count = 0; count < max_msgs_in_a_row && !prepare_shutdown; ++count) {

zizhong · 2017-05-23T03:49:33Z

Thanks! @jpeach Finally we moved all the sleep()s out of the signal handler.

zizhong · 2017-05-23T03:51:29Z

mgmt/ProcessManager.cc

-  return;
-} /* End ProcessManager::reconfigure */
+  if (RecGetRecordInt("proxy.config.process_manager.timeout", &timeout) != REC_ERR_OKAY) {
+    // Default to 5sec if the timeout is unspecified.


tiny inconsistency between the comment and the code here.

It seems correct to me. Has this been fixed already?

jpeach · 2017-05-23T06:10:38Z

@maskit The actual problem with the test failures is that simply scanning for the string Fatal is not always detecting the error. When traffic_manager goes down, it is expected that traffic_server will read a EOF and also go down, which has the side-effect of printing an error with the string Fatal in it. I made most of these cases into level Alert, which is basically equivalent.

There's some cleanup to do around how Diags.cc calls ink_fatal_va() to really make this consistent. I think that's best left to another PR.

maskit · 2017-05-23T14:04:30Z

@jpeach Does that mean there's no bad EOF cases? I agree that the Fatal during graceful shutdown was false alert, but I'm not sure whether we can just change it to Alert. This is the reason I added prepare_shutdown, and I guess @dragon512 added the check for shutdown_event_system for the same reason.
a85a5a6

jpeach · 2017-05-23T14:57:42Z

If traffic_manager crashes it is a problem. If traffic_manager exited on SIGTERM before traffic_server could get to it, it's not a problem. As you can see there's no way to tell the difference. I don't think that more globals to add nuance to the shutdown is really helping :)

maskit · 2017-05-23T16:10:29Z

Right, it seems we are on the same page. The global flag would be just a small hint, but not a perfect way.
I'm going approve this after some testing.

maskit · 2017-05-23T23:25:27Z

I don't no why but I still see Fatal: exiting with EOF from process manager in traffic.out with the latest patch, Jenkins looks happy though.

I also see [May 24 07:51:44.651] Server {0x2b706d4a4700} ALERT: exiting with EOF from process manager in digs.log

jpeach · 2017-05-23T23:26:34Z

The lowercase Fatal comes from ink_fatal_va and is hard-coded. We should think how to clean that up separately.

maskit · 2017-05-23T23:46:11Z

Got it, .. but it seems ATS shuts down immediately right after receiving SIGTERM.

jpeach · 2017-05-24T04:32:48Z

@maskit I hoisted the stop() a little earlier. Lemme know what you think.

jpeach · 2017-05-30T03:57:14Z

@maskit Can you please take another look?

maskit · 2017-05-30T06:02:57Z

It seems something is wrong. With this change, I don't see logs like below

[May 30 14:52:48.132] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:248 (pollLMConnection)> (pmgmt) [ProcessManager::pollLMConnection] Message: '10009'
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:271 (pollLMConnection)> (pmgmt) [ProcessManager::pollLMConnection] enqueued 1 of max 10000 messages in a row
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:140 (processEventQueue)> (pmgmt) [ProcessManager] ==> Processing event id '10009' payload=272
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:163 (processSignalQueue)> (pmgmt) [ProcessManager] ==> Signalling local manager '16'
[May 30 14:52:49.133] Server {0x2b856deff700} DEBUG: <ProcessManager.cc:163 (processSignalQueue)> (pmgmt) [ProcessManager] ==> Signalling local manager '16'

and see this error message.

$ sudo /opt/trafficserver/bin/traffic_ctl server stop
traffic_ctl: server stop failed: [14] Generic Fail message (ie. CoreAPI call).

dragon512 · 2017-05-30T16:16:45Z

Is this going to try to make sure we flush the buffers? I know we are having issues with buffers not flushing and issues with code coverage tests failing as we have to kill ats to shut it down. which prevents code coverage data from being written correctly to disk

jpeach · 2017-05-30T16:18:59Z

@dragon512 this will help address this, depending on how you kill but it is not at all clear to me how autest manages the processes. It is possible to make code coverage work (see the ci/coverage script) and this can help.

jpeach · 2017-05-30T16:19:13Z

@zwoop is ICC busted?

dragon512 · 2017-05-30T17:12:45Z

autest by default will try to a ctrl-c then a term signal after a delay. I think I have a delay set to zero for ATS as before ats would never shutdown based on the ctrl-c being passes.. not sure what was going on here.

jpeach · 2017-05-30T17:18:42Z

On May 30, 2017, at 10:12 AM, dragon512 ***@***.***> wrote: autest by default will try to a ctrl-c then a term signal after a delay. I think I have a delay set to zero for ATS as before ats would never shutdown based on the ctrl-c being passes.. not sure what was going on here.

The signal Ctrl-C sends depends on the terminal setup. What signal is `autest` actually sending and which process is it sending it to? If it is sending SIGINT, then that has never caused a shutdown. I don't know if documentation is the right solution, but somehow we need to make the `autest` behaviour and expectations more obvious.

dragon512 · 2017-05-30T17:24:53Z

I making a new drop of AuTest today.. I going to add some --verbose messages to help clarify what is going on.

As far as AuTest. It does basically a:

os.killpg(pgid, signal.SIGINT) # pgid is a process group setup for the process we started
do a configurable wait to allow the process to shutdown cleanly
os.killpg(pgid, signal.SIGKILL) # die die die

What do you suggest I should use instead of sigint?

Clean up the ProcessManager so remove use of mgmt_log. Refactor to improve comment and code legibility. Move the code to tear down the message queue from the destructor into the stop() function, and capture the poll thread so that we can join it and stop the manager relatively gracefully.

Simplify ProcessManager message handling to make it clearer when we are stopping and wen we are dealing with a process manager error pumping messages to traffic_manager. We do this by hoisting the message reading cose into a helper function that just deals with reading the data, and propagating the error up to the manager thread. At that point, we can more easily know whether we are shutting down or not.

If we are trying to stop the ProcessManager, the poll thread might be sleeping, so we now send it SIGINT to break the sleep. This causes it to notice that it should no longer be running earlier.

Rather than sleeping in a signal handler, unify the signal handling so that we always bounce the actual work up to the SignalContinuation. Once we are there, then we can send a timed event to shutdown after the timeot, or just cause a normal shutdown immediately. On the exit path, stop the ProcessManager to keep the integration tests happy.

jpeach · 2017-06-03T22:23:51Z

@dragon512 send SIGTERM to traffic_cop when you want things to die.

@zwoop @maskit please review :)

This reverts commit 1f54cf0. We no longer need to explicitly flush coverage data, because we don't _exit(2) from signal handler any more.

jpeach · 2017-06-04T18:56:32Z

@shinrich This branch reverts 1f54cf0 because we don't _exit() any more.

maskit

It looks good to me.

I'm ok with this, but please consider using AutoStopCont instead of Alert().

maskit · 2017-06-05T03:23:25Z

mgmt/ProcessManager.cc

    if (mh->msg_id == MGMT_EVENT_SHUTDOWN) {
-      mgmt_fatal(0, "[ProcessManager::processEventQueue] Shutdown msg received, exiting\n");
-    } /* Exit on shutdown */
+      Alert("exiting on shutdown message");


What do you think scheduling AutoStop in mgmt_restart_shutdown_callback instead of immediate shutdown with Alert() here? I think using Alert() for a normal exit is still too much. Handling a shutdown event as the same as SIGTERM makes normal shutdown process more consistent.

shinrich · 2017-06-05T13:53:35Z

Reverting the gcov flush sounds fine to me as long as the coverage data gets pushed out during the testing process.

masaori335 · 2018-02-14T03:06:53Z

Changed milestone to 8.0.0, because 7.1.0 doesn't have this.

jpeach requested review from maskit and zwoop May 18, 2017 05:32

jpeach added Cleanup Documentation labels May 18, 2017

jpeach added this to the 8.0.0 milestone May 18, 2017

danobi reviewed May 18, 2017

View reviewed changes

maskit approved these changes May 18, 2017

View reviewed changes

maskit requested changes May 19, 2017

View reviewed changes

maskit mentioned this pull request May 19, 2017

Access logs aren't flusing on traffic_server death #1670

Closed

jpeach force-pushed the shutdown branch from 8739af8 to da2c3d1 Compare May 21, 2017 18:09

zizhong reviewed May 23, 2017

View reviewed changes

jpeach force-pushed the shutdown branch from da2c3d1 to 5669ffb Compare May 23, 2017 06:03

jpeach force-pushed the shutdown branch from 5669ffb to 090fb5f Compare May 23, 2017 06:11

jpeach force-pushed the shutdown branch from 090fb5f to 6c43728 Compare May 24, 2017 04:26

jpeach force-pushed the shutdown branch from 1217cf0 to 0c4d117 Compare June 3, 2017 21:23

jpeach added 4 commits June 3, 2017 15:03

Wake the process manager thread on shutdown.

607ac36

If we are trying to stop the ProcessManager, the poll thread might be sleeping, so we now send it SIGINT to break the sleep. This causes it to notice that it should no longer be running earlier.

Document traffic_server signals.

e7d4025

jpeach force-pushed the shutdown branch from 0c4d117 to e7d4025 Compare June 3, 2017 22:20

Revert "Add logic to flush gcov data from traffic_server on interrupt."

9323388

This reverts commit 1f54cf0. We no longer need to explicitly flush coverage data, because we don't _exit(2) from signal handler any more.

jpeach requested a review from shinrich June 4, 2017 18:54

maskit approved these changes Jun 5, 2017

View reviewed changes

shinrich approved these changes Jun 5, 2017

View reviewed changes

jpeach merged commit da1ce2c into apache:master Jun 6, 2017

maskit mentioned this pull request Jun 9, 2017

Support graceful shutdown on both HTTP/1.1 and HTTP/2 #2106

Merged

maskit modified the milestones: 7.1.0, 8.0.0 Jun 9, 2017

maskit added the Backport Marked for backport for an LTS patch release label Jun 9, 2017

masaori335 modified the milestones: 7.1.0, 8.0.0 Feb 14, 2018

Improve graceful shutdown. #1937

Improve graceful shutdown. #1937

Uh oh!

Conversation

jpeach commented May 18, 2017

Uh oh!

jpeach commented May 18, 2017

Uh oh!

danobi May 18, 2017

Choose a reason for hiding this comment

Uh oh!

jpeach May 18, 2017

Choose a reason for hiding this comment

Uh oh!

SolidWallOfCode May 19, 2017

Choose a reason for hiding this comment

Uh oh!

maskit commented May 18, 2017

Uh oh!

maskit left a comment

Choose a reason for hiding this comment

Uh oh!

maskit commented May 18, 2017

Uh oh!

jpeach commented May 18, 2017

Uh oh!

maskit commented May 18, 2017

Uh oh!

jpeach commented May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maskit left a comment

Choose a reason for hiding this comment

Uh oh!

maskit commented May 19, 2017

Uh oh!

zizhong commented May 23, 2017

Uh oh!

zizhong May 23, 2017

Choose a reason for hiding this comment

Uh oh!

maskit May 24, 2017

Choose a reason for hiding this comment

Uh oh!

jpeach commented May 23, 2017

Uh oh!

maskit commented May 23, 2017

Uh oh!

jpeach commented May 23, 2017

Uh oh!

maskit commented May 23, 2017

Uh oh!

maskit commented May 23, 2017

Uh oh!

jpeach commented May 23, 2017

Uh oh!

maskit commented May 23, 2017

Uh oh!

jpeach commented May 24, 2017

Uh oh!

jpeach commented May 30, 2017

Uh oh!

maskit commented May 30, 2017

Uh oh!

dragon512 commented May 30, 2017

Uh oh!

jpeach commented May 30, 2017

Uh oh!

jpeach commented May 30, 2017

Uh oh!

dragon512 commented May 30, 2017

Uh oh!

jpeach commented May 30, 2017 via email

Uh oh!

dragon512 commented May 30, 2017

Uh oh!

jpeach commented Jun 3, 2017

Uh oh!

jpeach commented Jun 4, 2017

Uh oh!

jpeach commented May 19, 2017 •

edited

Loading