-
Notifications
You must be signed in to change notification settings - Fork 277
Troubleshooting
We all love AppScale, but like all software, it once in a while has problems. This post outlines what to do when you run into a problem with AppScale, how to debug it, and how to fix it. Of course, you can always ask us for help on IRC (#appscale on freenode.net). Let's start off with some common problems we've seen people run into, how to get past those, and then look at what to do when the going gets tough.
AppScale runs many processes with each of these processes taking up memory. If there is not enough, the OOM Killer will come along and start killing processes and AppScale will start acting very weird. If AppScale is not working correctly make sure that you didn't run out of memory. Check '/var/log/kern.log' and '/var/log/syslog' on your AppScale nodes to make sure this is not the case.
$ tail /var/log/kern.log
Feb 14 00:10:54 appscale-image0 kernel: [203916.804124] Out of memory: Kill process 28026 (python) score 182 or sacrifice child
Feb 14 00:10:54 appscale-image0 kernel: [203916.804320] Killed process 28036 (python) total-vm:810672kB, anon-rss:550012kB, file-rss:0kB
AppScale uses monit to monitor all the processes on the node. If it gets killed off, it will no longer restart downed processes. Is it running? Did it get killed off for some reason?
$ ps aux | grep monit
root 25043 0.1 0.0 103532 2796 ? Sl Feb12 3:32 /usr/local/bin/monit
root 28906 0.0 0.0 9396 900 pts/0 S+ 14:57 0:00 grep --color=auto monit
If you do "appscale status" do you get "[Errno 111] Connection refused"? If so, that generally means that the AppController is no longer running. This could be a bug in the AppController, check the logs. Most commonly, its because it was killed off by the OOM killer.
To bring the processes back up, just restart monit.
$ monit -c /etc/monitrc
Starting monit daemon with http interface at [*:2812]
These two are critical services for data storage. You can see their logs in /var/log/zookeeper and /var/log/cassandra.
If you ran "appscale up" to start AppScale and it didn't start, it could have failed for any of the following reasons:
- (VirtualBox) AppScale hung at "Please wait for AppScale to start your machines."
- (EC2) You're using Spot Instances but AppScale is hung at "Waiting for machines to become available."
- (Eucalyptus) AppScale hung at "Waiting for machines to become available."
Let's look at each of these individually.
When running AppScale on VirtualBox, we've seen problems when VirtualBox 4.1.X is used. Specifically, the AppScale Tools will start up the AppController on port 17443 and then hang at "Please wait for AppScale to start your machines." In this case, the AppScale Tools are waiting for port 17443 to open on the VM, but can't actually reach the VM, which has that port open. Upgrade to VirtualBox 4.2 or newer and that should fix the problem.
If you're using Spot Instances (you've set "use_spot_instances : True" in your AppScalefile), there is a possibility that Amazon won't have any spare machines available at the price and instance type you requested. Typically it takes us about 5 minutes to get a Spot Instance, so if it takes you substantially longer than that (say, 10 minutes), then you can log into the AWS Dashboard, click on EC2, and then click on Spot Instances. There, you can see why your machines aren't available. You can cancel your Spot Instance Request and try again with a higher price or a different instance type, depending on the message the dashboard reports.
When running on Eucalyptus, if there are no virtual machines available, AppScale won't be able to start up. For example, if you tell AppScale to run over 8 machines, and you only have 6 available, then that won't work! In this case, you'll see a message from the tools saying "Spawning 7 virtual machines" (since we spawn one machine and delegate the responsibility of starting up the other 7 to it), and the tools will eventually crash, since the AppController won't be able to get the remaining 7 machines. In this case, the solution is simple - make sure you have enough virtual machines available before you start AppScale! In Eucalyptus, an administrator can find out how many virtual machines are free by running "euca-describe-availability-zones verbose".
If, for some reason, running "appscale down" isn't able to terminate your AppScale deployment, you can bring your VMs back to a pristine state by running:
appscale clean
This script forcefully kills all of the AppScale-related processes.
So you've ran into a problem we don't normally run into - how do you find out what's going on? For this case, we have a special command you can run. On the machine that you've got the AppScale tools installed on, run "appscale logs ~/Desktop/baz" and this will copy over all of the logs from each machine in your AppScale deployment to ~/Desktop/baz (of course, change that path if you want your logs copied somewhere else). If this doesn't work for some reason, you can always use "scp" to copy over the contents of the "/var/log/appscale" directory on each machine.
Logs you will find interesting include:
- controller-17443.log: The most interesting log! This log belongs to the AppController, our provisioning daemon. Since it sets up every other service in AppScale, this log can throw exceptions if Cassandra couldn't be started, if the autoscaling algorithm ran into problems, and so on. This is the first place you want to look in if you're having problems with AppScale. You'll find one of these on each machine in an AppScale deployment, since this service runs on all machines.
- app___app_id-*.log: These logs correspond to Google App Engine apps that AppScale is hosting. You'll want to check these out if you're running into problems with your App Engine apps, like if you want to include special libraries that App Engine doesn't normally support or are debugging your application at high load. You'll find one of these for each App Server process that runs on each machine running the "App Engine" role (see which machines are running this service by running "appscale status").