-
Notifications
You must be signed in to change notification settings - Fork 277
NoOps in AppScale with Monit
Programs that run will eventually die. Hopefully they die because you decided that they don't need to run anymore, but sometimes they crash unexpectedly. That may be fine (although probably annoying) if you're running a small app that isn't critical to your business. But what do you do if it is business-critical? You may have an operations team who watches your web server and pages you at 2am if it goes down, but shouldn't there be a better way? There is! Let's talk about how we keep your apps running in AppScale.
Historically, AppScale has looked at the problem of keeping your services alive if they crash. And it turns out there's already a great piece of software that handles this problem for you: god. God is a RubyGem that lets you write Ruby code to dictate what processes should be monitored, and will automatically revive them if they die. God has been pretty solid for us over the last five years, and would monitor almost everything for us in AppScale. However, we've run into new issues as we've focused on making AppScale production-quality. Specifically, how do we kill processes that are running, but using too much memory? God says it can do it, and gives a very nice piece of code on their homepage to do so, but it just didn't work for us. After a lot of messing around, we were able to get God to revive dead processes OR kill processes using too much CPU or memory, but not both. But we need both, and after we were unable to get help from the god team on their mailing list, we searched for alternatives. We considered supervisord and upstart, but eventually converged on monit.
Why monit? Well, it does kill processes that use too much memory, and is able to revive dead processes, which is exactly what we need. As a nice added bonus, the syntax is very similar to that of god, so porting was not too much of an issue. In many cases, it's also much shorter to monitor a process with monit than god. Here's what our old god code used to look like for a single process:
God.watch do |w|
w.name = "appscale-controller-17443"
w.group = "controller"
w.interval = 30.seconds # default
w.start = "ruby /root/appscale/AppController/djinnServer.rb"
w.stop = "ruby /root/appscale/AppController/terminate.rb"
w.start_grace = 20.seconds
w.restart_grace = 20.seconds
w.log = "/var/log/appscale/controller-17443.log"
w.pid_file = "/var/appscale/controller-17443.pid"
w.behavior(:clean_pid_file)
w.start_if do |start|
start.condition(:process_running) do |c|
c.running = false
end
end
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.above = 150.megabytes
c.times = [3, 5] # 3 out of 5 intervals
end
restart.condition(:cpu_usage) do |c|
c.above = 50.percent
c.times = 5
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
w.env = {
"APPSCALE_HOME" => "/root/appscale",
}
end
and here's the corresponding monit code:
check process controller-17443 matching "/usr/bin/ruby /root/appscale/AppController/djinnServer.rb"
group controller
start program = "/bin/bash -c 'HOME=/root /usr/bin/ruby /root/appscale/AppController/djinnServer.rb 1>>/var/log/appscale/controller-17443.log 2>>/var/log/appscale/controller-17443.log'"
stop program = "/usr/bin/ruby /root/appscale/AppController/terminate.rb"
if memory is greater than 250 MB for 5 cycles then restart
This is perhaps not a perfect comparison. Our god file wastes many lines trying to set up the CPU and memory restart limits (which doesn't work for us on our Lucid machines), and the monit file is still a bit ugly in how it detects that the AppController is running (although it's much nicer than having to write PID files, which is monit's default). Having to exec a bash shell in monit to get stdout and stderr from our processes is also a bit ugly compared to god, but it works, so we can't complain too much there.
We make it sound like porting from god to monit was a piece of cake, but we actually ran into a number of issues while porting over. One thing that we already mentioned was that getting the stdout and stderr of processes is a bit ugly with monit (since we have to exec a bash shell to do it). Another ugly thing is having to specify the full path to each executable (e.g., /usr/bin/python
instead of just python
), but makes sense from a security standpoint. Monit also doesn't let you specify environment variables, so we have to get around this by doing bash -c 'export MY_VARIABLE=foo && /usr/bin/python myapp.py'
, which also works but is ugly and not well-documented.
Monit also loves to use process IDs to keep track of things, which is nice if you write all of the services yourself, but then you have to manage process IDs yourself, which is annoying. Thankfully, newer versions of monit let you specify what string to search for to see if the process is running (like a ps ax | grep your-service-name
). Finally, if you specify that a process should only be killed if it uses more than X memory, and it doesn't, and it forks a child process that does, that doesn't get caught by that rule. You have to instead say "if totalmem > X MB", so that it also includes child processes. The documentation talks about this a little bit, but more examples would have helped.
Our new monit support debuted in AppScale 1.12.0. We've very excited to have it in, so check it out and let us know what you think! Some things we've been looking into include customizing how much memory should be used as the threshold for killing a process, and if it should be percentage based instead of a hard limit. Check it out and as always, let us know what you think on #appscale on freenode.net or on our Google Group!