Rework sysv stop script to fix issues #181

pforman · 2015-08-26T06:24:19Z

Addresses #174.

I commented the stop function extensively so the next person in there will understand what I was trying to do and the signal logic there. It's pretty similar to what it was doing before, but without the race condition and using more appropriate signals.

In my tests the SIGINT-triggered leave happens in less than 1 second, but takes enough time that a wait check is necessary to avoid the race. We wait up to 5 seconds, checking every second to see if it's finished.

Commit message follows.

The sysv stop script had been set to issue a SIGKILL (-9) to ensure that
a server would not perform a leave operation. That's excessive, and the
logic in that section had a race condition that could result in a client
unpredictably ending in either a "left" or "failed" state.

Reworked this section to issue a SIGINT to clients (which initiates a
leave by default) and then check for success. If the leave operation
fails or if the agent is acting as a server, then the standard killproc
is issued, which tries a SIGTERM, followed by a SIGKILL if the TERM is
not effective.

Also fixed a minor issue where the stop return status would often be
displayed in the prompt line. The start status is eaten by Consul's
default stdout logging, so there's no fixing that. It simply ends up in
the log file.

The sysv stop script had been set to issue a SIGKILL (-9) to ensure that a server would not perform a leave operation. That's excessive, and the logic in that section had a race condition that could result in a client unpredictably ending in either a "left" or "failed" state. Reworked this section to issue a SIGINT to clients (which initiates a leave by default) and then check for success. If the leave operation fails or if the agent is acting as a server, then the standard killproc is issued, which tries a SIGTERM, followed by a SIGKILL if the TERM is not effective. Also fixed a minor issue where the stop return status would often be displayed in the prompt line. The start status is eaten by Consul's default stdout logging, so there's no fixing that. It simply ends up in the log file.

aj-jester · 2015-08-26T14:48:44Z

templates/consul.sysv.erb

+        #
+        # A SIGTERM will mark the node as "failed" until it rejoins.
+        # killproc with no arguments uses TERM, then escalates to KILL.
+        killproc $KILLPROC_OPT $CONSUL


I realize you mentioned this in the comment but can we add -TERM anyway just to be explicit?

pforman · 2015-08-26T18:25:13Z

I'll try to roll up all the questions... Digging into how "killproc" works, it issues a TERM and then spin-waits until the process shuts down or exceeds its delay (3s by default). At that point it issues a KILL.

However, it doesn't do that if a specific signal is passed. It merely issues that signal once and then hopes for the best. That's why I didn't include -TERM on the server killproc invocation, because then it won't escalate to KILL.

I dropped the "else" clause on the client to take advantage of the signal escalation as well. INT, wait, TERM, wait, KILL. End early on success.

If a leave fails, I think it's reasonable to escalate to a TERM, since something's clearly wrong with the node. (I've never seen a leave fail, I'm not sure what circumstances could cause that, but perhaps a network partition?). If a TERM fails, we escalate to a KILL.

The $PID_FILE removal is normally handled by Consul, but in the case of a SIGKILL, that wouldn't happen. I think my patch is wrong and that line that should be reverted to

rm -f /var/lock/subsys/consul $PID_FILE

Good catch, I'll admit it was getting late last night.

I originally dropped that because the race to kill consul was removing it. Then consul itself would complain it couldn't remove the file, because the rm had been there first. Now that the wait happens in both client and server cases, it shouldn't cause any issues, and would be important in a SIGKILL case.

aj-jester · 2015-08-26T18:33:31Z

it was getting late last night

@pforman haha I know that feeling

So yah, I agree; INIT, wait, KILL and end early on success and clean up any lock/pid files 👍

pforman · 2015-08-26T18:43:02Z

Okay, I'll test and submit an update to revert the $PID_FILE cleanup. Thanks for the careful review.

The $PID_FILE was being removed by the stop command. Before the race condition for the INT/leave condition was fixed, the "rm" could remove the file before consul itself tried to, which made consul unhappy in the logs. Re-add the rm -f $PID_FILE to ensure its removed in a drastic SIGKILL case. The waits on INT or TERM will ensure that the rm only happens when it's needed.

solarkennedy · 2015-09-05T03:24:43Z

Yuck I hate init scripts :(

I'll review this carefully this weekend.

solarkennedy · 2015-09-07T18:57:57Z

Let's see how it goes!

Rework sysv stop script to fix issues

pforman · 2015-09-08T01:15:50Z

Thanks. I tested it in my use cases (Amazon Linux, so RH-derived) pretty aggressively, including looping stop-start for quite a while. I'll be embarrassed if I broke something :)

aj-jester reviewed Aug 26, 2015
View reviewed changes

pforman mentioned this pull request Sep 5, 2015

Strange (probably unnecessary) behavior in sysv stop script #174

Closed

solarkennedy added a commit that referenced this pull request Sep 7, 2015

Merge pull request #181 from pforman/master

32b264e

Rework sysv stop script to fix issues

solarkennedy merged commit 32b264e into voxpupuli:master Sep 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework sysv stop script to fix issues #181

Rework sysv stop script to fix issues #181

pforman commented Aug 26, 2015

aj-jester Aug 26, 2015

pforman commented Aug 26, 2015

aj-jester commented Aug 26, 2015

pforman commented Aug 26, 2015

solarkennedy commented Sep 5, 2015

solarkennedy commented Sep 7, 2015

pforman commented Sep 8, 2015

Rework sysv stop script to fix issues #181

Rework sysv stop script to fix issues #181

Conversation

pforman commented Aug 26, 2015

aj-jester Aug 26, 2015

Choose a reason for hiding this comment

pforman commented Aug 26, 2015

aj-jester commented Aug 26, 2015

pforman commented Aug 26, 2015

solarkennedy commented Sep 5, 2015

solarkennedy commented Sep 7, 2015

pforman commented Sep 8, 2015