-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An example build on AWS using latest generation GPU instances #1092
Comments
Make sure you pick a GPU instance that has compute capability >= 3.0. The wiki has a reference to an AMI but I'm not sure that it's up-to-date: https://github.com/BVLC/caffe/wiki/Setting-up-Caffe-on-Ubuntu-14.04 @cdoersch could you comment on any instance details from your recent installation? |
I've only gotten it to run on g2.2xlarge instances, which are the newest GPU's on EC2. I was using the starcluster HVM AMI which is ubuntu 12.04. Confusingly, it comes with its own version of cuda and the nvidia driver that's too old to run caffe. I find that from stock ubuntu, this does the trick:
Otherwise, I just followed the directions from the caffe website. |
There are a few docker files for Caffe. |
I followed the instructions to setup caffe on aws but still have issues related to CUDA. @cdoersch do you an ami that you can share? |
I unfortunately don't have one at the moment; there's some additional customizations on the machine I'm using (not to mention that I'm currently running out of AWS funds). If you have a more specific issue, post it and I may be able to help. |
@cdoersch here is the error message. |
Looks like your GPU isn't recognized. What's the result of nvidia-smi -a? |
modprobe: ERROR: could not insert 'nvidia_340': Unknown symbol in module, or unknown parameter (see dmesg) |
Did you get any errors when you ran apt-get install -y cuda? On my system, that installs the 340 driver. |
I just launched a new instance and did everything from scratch. Still doesn't work. E1016 03:36:38.640488 13775 common.cpp:98] Cannot create Curand generator. Curand won't be available. Any thoughts? |
That apt-get remove would have gotten rid of cuda too, wouldn't it? For me, cuda installs into /usr/local/cuda-6.5 when done via apt-get install cuda. Make sure it's there. Which version of ubuntu are you using? If it's a more recent version, you may have an nvidia-340 in the ubuntu repositories, which may cause conflicts with what you would get from nvidia's repositories. Also, this isn't a caffe problem, it's a problem with your install of cuda. I wouldn't bother running caffe again until you can get nvidia-smi -a to output your graphics card model. That program will hopefully give more helpful error messages. |
@AKSHAYUBHAT, I have an AMI that works with caffe on g2.2xlarge. We can chat offline if you want access, but I'm pretty sure it's just the HVM Ubuntu 14.04 AMI with cuda 6.5 and docker installed. Then I use my kmatzen/caffe or kmatzen/caffe-debug docker image. Both are available on hub.docker.com.
https://registry.hub.docker.com/u/kmatzen/caffe/dockerfile/ I also have a docker image called kmatzen/caffe-base that includes just the dependencies. The Dockerfile can be found in my repo: One thing you might want to change is that this Dockerfile references my mesos-base docker image. You could just change it to reference the ubuntu:14.04 docker image. |
@kmatzen your explanation and docker images could help a lot of new users -- if you have a chance, please add a section (perhaps under installation) to the wiki https://github.com/BVLC/caffe/wiki. |
I've made an AMI with the latest version of caffe on the g2.2x large instances. I ran into some issues setting up Cuda by starting with the AMI here: https://github.com/BVLC/caffe/wiki/Ubuntu-14.04-ec2-instance, so thought this might be useful: Starting from the image in that wiki article, I had to do the following (I'm relatively confident I've not missed any steps, but if I did I apologize - feel free to ping me and I can try to help):
|
Thanks for recording the changes to bring the ec2 instance up-to-date! On Mon, Oct 20, 2014 at 2:36 PM, Achal Dave notifications@github.com
|
Sure; the reason I didn't is because you don't need most of these except on a GPU instance, so I wasn't sure if I should modify that page to have a GPU section, confuse things by include a separate ami, or a new page. I'll try to do one of them soon Edit (December): Haven't had a chance yet sorry, if anyone else is up for it feel free to do so. |
Awesome, thanks for sharing the AMI @achalddave. I found that performance is much better with cudnn. To recompile caffe with cudnn I had to downgrade to g++-4.6, and upgrade to cuda 6.5 ./examples/mnist/train_lenet.sh on g2.2xlarge GPU CUDA 6.0 = 239 secs |
@AKSHAYUBHAT I wrote up instructions on how I got it working, including a public AMI with the nvidia kernel module + cuda 6.5 drivers that can be used as an easy starting point for the host OS. |
Thanks for all the great examples! I used this conversation heavily to create a chef cookbook that installs CUDA, cuDNN, and Caffe (with Python bindings) on an AWS g2.2xlarge running Ubuntu 14.04: https://github.com/robomakery/caffe-cookbook I am building AMIs with Caffe pre-installed using Packer and this cookbook. |
Thank you everyone. |
I have been unsuccessfully trying to get Caffe to install on latest GPU instances, is it possible to provide a public AMI that has caffe pre-installed?
The text was updated successfully, but these errors were encountered: