-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework sysv stop script to fix issues #181
Conversation
The sysv stop script had been set to issue a SIGKILL (-9) to ensure that a server would not perform a leave operation. That's excessive, and the logic in that section had a race condition that could result in a client unpredictably ending in either a "left" or "failed" state. Reworked this section to issue a SIGINT to clients (which initiates a leave by default) and then check for success. If the leave operation fails or if the agent is acting as a server, then the standard killproc is issued, which tries a SIGTERM, followed by a SIGKILL if the TERM is not effective. Also fixed a minor issue where the stop return status would often be displayed in the prompt line. The start status is eaten by Consul's default stdout logging, so there's no fixing that. It simply ends up in the log file.
# | ||
# A SIGTERM will mark the node as "failed" until it rejoins. | ||
# killproc with no arguments uses TERM, then escalates to KILL. | ||
killproc $KILLPROC_OPT $CONSUL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you mentioned this in the comment but can we add -TERM
anyway just to be explicit?
I'll try to roll up all the questions... Digging into how "killproc" works, it issues a TERM and then spin-waits until the process shuts down or exceeds its delay (3s by default). At that point it issues a KILL. However, it doesn't do that if a specific signal is passed. It merely issues that signal once and then hopes for the best. That's why I didn't include -TERM on the server killproc invocation, because then it won't escalate to KILL. I dropped the "else" clause on the client to take advantage of the signal escalation as well. INT, wait, TERM, wait, KILL. End early on success. If a leave fails, I think it's reasonable to escalate to a TERM, since something's clearly wrong with the node. (I've never seen a leave fail, I'm not sure what circumstances could cause that, but perhaps a network partition?). If a TERM fails, we escalate to a KILL. The $PID_FILE removal is normally handled by Consul, but in the case of a SIGKILL, that wouldn't happen. I think my patch is wrong and that line that should be reverted to
Good catch, I'll admit it was getting late last night. I originally dropped that because the race to kill consul was removing it. Then consul itself would complain it couldn't remove the file, because the |
@pforman haha I know that feeling So yah, I agree; INIT, wait, KILL and end early on success and clean up any lock/pid files 👍 |
Okay, I'll test and submit an update to revert the $PID_FILE cleanup. Thanks for the careful review. |
The $PID_FILE was being removed by the stop command. Before the race condition for the INT/leave condition was fixed, the "rm" could remove the file before consul itself tried to, which made consul unhappy in the logs. Re-add the rm -f $PID_FILE to ensure its removed in a drastic SIGKILL case. The waits on INT or TERM will ensure that the rm only happens when it's needed.
Yuck I hate init scripts :( I'll review this carefully this weekend. |
Let's see how it goes! |
Rework sysv stop script to fix issues
Thanks. I tested it in my use cases (Amazon Linux, so RH-derived) pretty aggressively, including looping stop-start for quite a while. I'll be embarrassed if I broke something :) |
Addresses #174.
I commented the stop function extensively so the next person in there will understand what I was trying to do and the signal logic there. It's pretty similar to what it was doing before, but without the race condition and using more appropriate signals.
In my tests the SIGINT-triggered leave happens in less than 1 second, but takes enough time that a wait check is necessary to avoid the race. We wait up to 5 seconds, checking every second to see if it's finished.
Commit message follows.
The sysv stop script had been set to issue a SIGKILL (-9) to ensure that
a server would not perform a leave operation. That's excessive, and the
logic in that section had a race condition that could result in a client
unpredictably ending in either a "left" or "failed" state.
Reworked this section to issue a SIGINT to clients (which initiates a
leave by default) and then check for success. If the leave operation
fails or if the agent is acting as a server, then the standard killproc
is issued, which tries a SIGTERM, followed by a SIGKILL if the TERM is
not effective.
Also fixed a minor issue where the stop return status would often be
displayed in the prompt line. The start status is eaten by Consul's
default stdout logging, so there's no fixing that. It simply ends up in
the log file.