Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An example build on AWS using latest generation GPU instances #1092

Closed
ghost opened this issue Sep 16, 2014 · 21 comments
Closed

An example build on AWS using latest generation GPU instances #1092

ghost opened this issue Sep 16, 2014 · 21 comments

Comments

@ghost
Copy link

ghost commented Sep 16, 2014

I have been unsuccessfully trying to get Caffe to install on latest GPU instances, is it possible to provide a public AMI that has caffe pre-installed?

@shelhamer
Copy link
Member

Make sure you pick a GPU instance that has compute capability >= 3.0.

The wiki has a reference to an AMI but I'm not sure that it's up-to-date: https://github.com/BVLC/caffe/wiki/Setting-up-Caffe-on-Ubuntu-14.04

@cdoersch could you comment on any instance details from your recent installation?

@cdoersch
Copy link
Contributor

I've only gotten it to run on g2.2xlarge instances, which are the newest GPU's on EC2. I was using the starcluster HVM AMI which is ubuntu 12.04. Confusingly, it comes with its own version of cuda and the nvidia driver that's too old to run caffe. I find that from stock ubuntu, this does the trick:

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1204/x86_64/cuda-repo-ubuntu1204_6.5-14_amd64.deb
dpkg -i cuda-repo-ubuntu1204_6.5-14_amd64.deb
apt-get update
apt-get install -y cuda

Otherwise, I just followed the directions from the caffe website.

@kloudkl
Copy link
Contributor

kloudkl commented Sep 26, 2014

There are a few docker files for Caffe.

@mmoghimi
Copy link

I followed the instructions to setup caffe on aws but still have issues related to CUDA.

@cdoersch do you an ami that you can share?

@cdoersch
Copy link
Contributor

I unfortunately don't have one at the moment; there's some additional customizations on the machine I'm using (not to mention that I'm currently running out of AWS funds). If you have a more specific issue, post it and I may be able to help.

@mmoghimi
Copy link

@cdoersch here is the error message.
http://pastebin.com/hfeEMVty

@cdoersch
Copy link
Contributor

Looks like your GPU isn't recognized. What's the result of nvidia-smi -a?

@mmoghimi
Copy link

modprobe: ERROR: could not insert 'nvidia_340': Unknown symbol in module, or unknown parameter (see dmesg)
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@cdoersch
Copy link
Contributor

Did you get any errors when you ran apt-get install -y cuda? On my system, that installs the 340 driver.

@mmoghimi
Copy link

I just launched a new instance and did everything from scratch.
apt-get install cuda installs 340 but I get error when I'm trying to run make runtest
then I uninstalled
sudo apt-get remove --purge nvidia-340 nvidia-modprobe nvidia and installed them from the .run file NVIDIA-Linux-x86_64-340.46.run and installed it.

Still doesn't work.

E1016 03:36:38.640488 13775 common.cpp:98] Cannot create Curand generator. Curand won't be available.
F1016 03:36:38.640584 13775 benchmark.cpp:87] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected

Any thoughts?

@cdoersch
Copy link
Contributor

That apt-get remove would have gotten rid of cuda too, wouldn't it?

For me, cuda installs into /usr/local/cuda-6.5 when done via apt-get install cuda. Make sure it's there.

Which version of ubuntu are you using? If it's a more recent version, you may have an nvidia-340 in the ubuntu repositories, which may cause conflicts with what you would get from nvidia's repositories.

Also, this isn't a caffe problem, it's a problem with your install of cuda. I wouldn't bother running caffe again until you can get nvidia-smi -a to output your graphics card model. That program will hopefully give more helpful error messages.

@kmatzen
Copy link
Contributor

kmatzen commented Oct 16, 2014

@AKSHAYUBHAT, I have an AMI that works with caffe on g2.2xlarge. We can chat offline if you want access, but I'm pretty sure it's just the HVM Ubuntu 14.04 AMI with cuda 6.5 and docker installed. Then I use my kmatzen/caffe or kmatzen/caffe-debug docker image. Both are available on hub.docker.com.

sudo docker run -t -i --privileged -v /mnt/datastore:/datastore kmatzen/caffe /bin/bash

https://registry.hub.docker.com/u/kmatzen/caffe/dockerfile/
https://registry.hub.docker.com/u/kmatzen/caffe-debug/dockerfile/

I also have a docker image called kmatzen/caffe-base that includes just the dependencies. The Dockerfile can be found in my repo:
https://github.com/kmatzen/caffe/blob/mesos/docker/base/Dockerfile

One thing you might want to change is that this Dockerfile references my mesos-base docker image. You could just change it to reference the ubuntu:14.04 docker image.

@shelhamer
Copy link
Member

@kmatzen your explanation and docker images could help a lot of new users -- if you have a chance, please add a section (perhaps under installation) to the wiki https://github.com/BVLC/caffe/wiki.

@achalddave
Copy link
Contributor

I've made an AMI with the latest version of caffe on the g2.2x large instances. I ran into some issues setting up Cuda by starting with the AMI here: https://github.com/BVLC/caffe/wiki/Ubuntu-14.04-ec2-instance, so thought this might be useful: ami-03f2e746 on N. California.

Starting from the image in that wiki article, I had to do the following (I'm relatively confident I've not missed any steps, but if I did I apologize - feel free to ping me and I can try to help):

  • Get the latest official caffe repo. The one in that image is an unmaintained fork as far as I can tell.
  • sudo apt-get install libgflags-dev liblmdb-dev (unrelated to gpu - this was not installed in the image as I think it's a more recent dependency)
  • Add the following to /etc/modprobe.d/blacklist-nouveau.conf
blacklist nvidiafb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist vga16fb
options nouveau modeset=0
  • Run sudo update-initramfs -u, sudo reboot
  • sudo apt-get install linux-image-extra-virtual
  • Remove gcc-4.6, install gcc-4.8 if necessary, make sure gcc-4.8 is available.
sudo apt-get remove gcc-4.6
sudo apt-get install gcc-4.8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 20
sudo update-alternatives --install /usr/bin/cc gcc /usr/bin/gcc-4.8 20
  • (unsure). I ran sudo apt-get install linux-headers-$(uname -r), but I forget if this was necessary.

@shelhamer
Copy link
Member

Thanks for recording the changes to bring the ec2 instance up-to-date!
Could you update the wiki page to reflect your steps?
https://github.com/BVLC/caffe/wiki/Ubuntu-14.04-ec2-instance

On Mon, Oct 20, 2014 at 2:36 PM, Achal Dave notifications@github.com
wrote:

I've made an AMI with the latest version of caffe on the g2.2x large
instances. I ran into some issues setting up Cuda by starting with the AMI
here: https://github.com/BVLC/caffe/wiki/Ubuntu-14.04-ec2-instance, so
thought this might be useful: ami-03f2e746 on N. California.

Starting from the image in that wiki article, I had to do the following
(I'm relatively confident I've not missed any steps, but if I did I
apologize - feel free to ping me and I can try to help):

  • Get the latest official caffe repo. The one in that image is an
    unmaintained fork as far as I can tell.
  • sudo apt-get install libgflags-dev liblmdb-dev (unrelated to gpu -
    this was not installed in the image as I think it's a more recent
    dependency)
  • Add the following to `/etc/modprobe.d/blacklist-nouveau.conf

blacklist nvidiafb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist vga16fb
options nouveau modeset=0

  • Run sudo update-initramfs -u, sudo reboot
  • sudo apt-get install linux-image-extra-virtual
  • Remove gcc-4.6, install gcc-4.8 if necessary, make sure gcc-4.8 is
    available.

sudo apt-get remove gcc-4.6
sudo apt-get install gcc-4.8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 20
sudo update-alternatives --install /usr/bin/cc gcc /usr/bin/gcc-4.8 20

  • (unsure). I ran sudo apt-get install linux-headers-$(uname -r), but
    I forget if this was necessary.


Reply to this email directly or view it on GitHub
#1092 (comment).

@achalddave
Copy link
Contributor

Sure; the reason I didn't is because you don't need most of these except on a GPU instance, so I wasn't sure if I should modify that page to have a GPU section, confuse things by include a separate ami, or a new page. I'll try to do one of them soon

Edit (December): Haven't had a chance yet sorry, if anyone else is up for it feel free to do so.

@shuggiefisher
Copy link

Awesome, thanks for sharing the AMI @achalddave. I found that performance is much better with cudnn. To recompile caffe with cudnn I had to downgrade to g++-4.6, and upgrade to cuda 6.5

./examples/mnist/train_lenet.sh on g2.2xlarge

GPU CUDA 6.0 = 239 secs
CPU = 1075 secs
GPU CUDA 6.5 w/CuDNN g++-4.6 = 47 secs
CPU g++-4.6 = 1052secs

@tleyden
Copy link

tleyden commented Oct 27, 2014

I have been unsuccessfully trying to get Caffe to install on latest GPU instances, is it possible to provide a public AMI that has caffe pre-installed?

@AKSHAYUBHAT I wrote up instructions on how I got it working, including a public AMI with the nvidia kernel module + cuda 6.5 drivers that can be used as an easy starting point for the host OS.

See Running Caffe on AWS GPU Instance via Docker

@dylanvaughn
Copy link

Thanks for all the great examples! I used this conversation heavily to create a chef cookbook that installs CUDA, cuDNN, and Caffe (with Python bindings) on an AWS g2.2xlarge running Ubuntu 14.04:

https://github.com/robomakery/caffe-cookbook

I am building AMIs with Caffe pre-installed using Packer and this cookbook.

@ghost
Copy link
Author

ghost commented Jan 16, 2015

Thank you everyone.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants