Skip to content

Commit

Permalink
[NAT] Exit from natmgrd correctly after receiving SIGTERM signal. (so…
Browse files Browse the repository at this point in the history
…nic-net#2232)

**What I did**
Fix the issue when *natmgrd* remains running after receiving SIGTERM signal and performing clean-up. 

**Why I did it**
Supervisord sends SIGTERM to all applications that it controls. Application after receiving the signal may optionally run a clean-up routine and should exit. natmngrd runs a cleanup from the signal handler but doesn't exit after that. After supervisord sends the signal it waits for 10 seconds (default timeout) to allow the application to exit correctly. If the application is not exiting in the given timeout supervisord kills the application. natmngrd is always killed by the supervisord. This affects the container restart time. 

Before the fix NAT container shutting down takes ~10 seconds. Docker kills the application with SIGKILL signal after timeout:
```
# time /usr/bin/nat.sh stop

real    0m10.420s
user    0m0.227s
sys     0m0.046s

Apr 18 10:46:16.881285 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:16,879 WARN received SIGTERM indicating exit request
Apr 18 10:46:16.881285 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:16,880 INFO waiting for supervisor-proc-exit-listener, rsyslogd, natmgrd, natsyncd to die
Apr 18 10:46:17.883936 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:17,883 INFO stopped: natsyncd (terminated by SIGTERM)
Apr 18 10:46:17.883936 r-leopard-32 NOTICE nat#natmgrd: :- sigterm_handler: Got SIGTERM
Apr 18 10:46:17.891989 r-leopard-32 INFO nat#/supervisord: natmgrd conntrack v1.4.5 (conntrack-tools): connection tracking table has been emptied.
Apr 18 10:46:17.891989 r-leopard-32 NOTICE nat#natmgrd: :- sigterm_handler: Sending notification to orchagent to cleanup NAT entries in REDIS/ASIC
Apr 18 10:46:19.895501 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:19,894 INFO waiting for supervisor-proc-exit-listener, rsyslogd, natmgrd to die
Apr 18 10:46:22.898947 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:22,898 INFO waiting for supervisor-proc-exit-listener, rsyslogd, natmgrd to die
Apr 18 10:46:25.903148 r-leopard-32 INFO nat#supervisord 2022-04-18 10:46:25,902 INFO waiting for supervisor-proc-exit-listener, rsyslogd, natmgrd to die
Apr 18 10:46:26.115147 r-leopard-32 INFO dockerd[737]: time="2022-04-18T10:46:26.114201315Z" level=info msg="Container ec5804a8ccd413786392c27ac3e61d4dfe67c8e5558c91b6c6bf0712cf85d07a failed to exit within 10 seconds of signal 15 - using the force"

```
After the fix NAT container shutting down takes ~4 seconds. The application exits correctly after receiving SIGTERM signal:
```
# time /usr/bin/nat.sh stop

real    0m4.166s
user    0m0.219s
sys     0m0.036s

Apr 18 10:52:23.611991 r-leopard-32 INFO nat#supervisord 2022-04-18 10:52:23,610 WARN received SIGTERM indicating exit request
Apr 18 10:52:23.611991 r-leopard-32 INFO nat#supervisord 2022-04-18 10:52:23,610 INFO waiting for supervisor-proc-exit-listener, rsyslogd, natmgrd, natsyncd to die
Apr 18 10:52:23.613338 r-leopard-32 INFO nat#supervisord 2022-04-18 10:52:23,612 INFO stopped: natsyncd (terminated by SIGTERM)
Apr 18 10:52:24.620815 r-leopard-32 INFO nat#/supervisord: natmgrd conntrack v1.4.5 (conntrack-tools): connection tracking table has been emptied.
Apr 18 10:52:24.620815 r-leopard-32 NOTICE nat#natmgrd: :- cleanup: Sending notification to orchagent to cleanup NAT entries in REDIS/ASIC
Apr 18 10:52:25.407307 r-leopard-32 INFO nat#supervisord 2022-04-18 10:52:25,406 INFO stopped: natmgrd (exit status 0)
```

**How I verified it**
Stop NAT container. Check syslog whether natmgrd application exited correctly with return code '0'.
  • Loading branch information
oleksandrivantsiv authored May 4, 2022
1 parent 5329c39 commit f06114c
Showing 1 changed file with 16 additions and 3 deletions.
19 changes: 16 additions & 3 deletions cfgmgr/natmgrd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,24 @@ NatMgr *natmgr = NULL;
NotificationConsumer *timeoutNotificationsConsumer = NULL;
NotificationConsumer *flushNotificationsConsumer = NULL;

static volatile sig_atomic_t gExit = 0;

std::shared_ptr<swss::NotificationProducer> cleanupNotifier;

void sigterm_handler(int signo)
{
SWSS_LOG_ENTER();

gExit = 1;
}

void cleanup()
{
int ret = 0;
std::string res;
const std::string conntrackFlush = "conntrack -F";

SWSS_LOG_NOTICE("Got SIGTERM");
SWSS_LOG_ENTER();

/*If there are any conntrack entries, clean them */
ret = swss::exec(conntrackFlush, res);
Expand Down Expand Up @@ -154,7 +163,7 @@ int main(int argc, char **argv)
s.addSelectable(flushNotificationsConsumer);

SWSS_LOG_NOTICE("starting main loop");
while (true)
while (!gExit)
{
Selectable *sel;
int ret;
Expand Down Expand Up @@ -197,10 +206,14 @@ int main(int argc, char **argv)
auto *c = (Executor *)sel;
c->execute();
}

cleanup();
}
catch(const std::exception &e)
{
SWSS_LOG_ERROR("Runtime error: %s", e.what());
return EXIT_FAILURE;
}
return -1;

return 0;
}

0 comments on commit f06114c

Please sign in to comment.