Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Issue about multiple GPUs #1064

Closed
korabelnikov opened this issue Sep 14, 2016 · 8 comments
Closed

Issue about multiple GPUs #1064

korabelnikov opened this issue Sep 14, 2016 · 8 comments

Comments

@korabelnikov
Copy link

Doesn't train with multiple GPUs

  1. create MNIST & LeNet
  2. select few GPUs
  3. get blank instead figures

but it works fine with single gpu. tested with either caffe and torch engines.

image

@lukeyeager
Copy link
Member

So it just hangs? Do you get any error messages in the Caffe/Torch logs?

Which GPUs do you have? You can use digits/device_query.py.

@korabelnikov
Copy link
Author

korabelnikov commented Sep 14, 2016

@lukeyeager yes. i have left it on few hours and get this
image
i have 4 k80 gpu

torch log:

tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] Loading mean tensor from /usr/share/digits/digits/jobs/20160914-152248-263f/mean.jpg file
2016-09-14 15:22:50 [INFO ] Loading label definitions from /usr/share/digits/digits/jobs/20160914-150454-2a0a/labels.txt file
2016-09-14 15:22:50 [INFO ] found 10 categories
2016-09-14 15:22:50 [INFO ] creating data readers
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] found 45002 images in train db/usr/share/digits/digits/jobs/20160914-150454-2a0a/train_db
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] Image channels are 1, Image width is 28 and Image height is 28
2016-09-14 15:22:50 [INFO ] found 14998 images in train db/usr/share/digits/digits/jobs/20160914-150454-2a0a/val_db
2016-09-14 15:22:51 [INFO ] Loading network definition from /usr/share/digits/digits/jobs/20160914-152248-263f/model
Using CuDNN backend
2016-09-14 15:22:51 [INFO ] Train batch size is 64 and validation batch size is 32
2016-09-14 15:22:51 [INFO ] Network definition:
DataParallelTable: 2 x nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
(1): nn.MulConstant
(2): cudnn.SpatialConvolution(1 -> 20, 5x5)
(3): cudnn.SpatialMaxPooling(2x2, 2,2)
(4): cudnn.SpatialConvolution(20 -> 50, 5x5)
(5): cudnn.SpatialMaxPooling(2x2, 2,2)
(6): nn.View(-1)
(7): nn.Linear(800 -> 500)
(8): cudnn.ReLU
(9): nn.Linear(500 -> 10)
(10): nn.LogSoftMax
}
2016-09-14 15:22:51 [INFO ] Network definition ends
2016-09-14 15:22:51 [INFO ] switching to CUDA
2016-09-14 15:22:52 [INFO ] initializing the parameters for learning rate policy: step
2016-09-14 15:22:52 [INFO ] initializing the parameters for Optimizer
2016-09-14 15:22:52 [INFO ] During training. details will be logged after every 5000 images
2016-09-14 15:22:52 [INFO ] Training epochs to be completed for each validation : 1
2016-09-14 15:22:52 [INFO ] Training epochs to be completed before taking a snapshot : 1
2016-09-14 15:22:52 [INFO ] While logging, epoch value will be rounded to 3 significant digits
2016-09-14 15:22:52 [INFO ] started training the model
2016-09-14 15:23:39 [INFO ] Validation (epoch 0): loss = -14.095491760067, accuracy = 0.11988265102014
2016-09-14 15:23:39 [INFO ] Training (epoch 0.001): loss = 1.1556785106659, lr = 0.01

@lukeyeager
Copy link
Member

Looks like you're using the deb package to install, right? Can you send me the output of this command:

$ dpkg -l | grep 'cudart\|libcudnn\|libnccl\|caffe\|torch\|digits'

@korabelnikov
Copy link
Author

korabelnikov commented Sep 15, 2016

@lukeyeager I'm using image of nvidia-docker digits.

root@15966d15b7e8:/usr/share/digits# dpkg -l | grep 'cudart\|libcudnn\|libnccl\|caffe\|torch\|digi      ts'
ii  caffe-nv                           0.15.9-1+cuda7.5                        amd64        Fast open framework for Deep Learning
ii  caffe-nv-tools                     0.15.9-1+cuda7.5                        amd64        Fast open framework for Deep Learning (Tools)
ii  cuda-cudart-7-5                    7.5-18                                  amd64        CUDA Runtime native Libraries
ii  digits                             4.0.0-1                                 amd64        NVIDIA       DIGITS webserver
ii  libcaffe-nv0                       0.15.9-1+cuda7.5                        amd64        Fast o      pen framework for Deep Learning (Libs)
ii  libcudnn5                          5.1.3-1+cuda7.5                         amd64        cuDNN       runtime libraries
ii  libnccl1                           1.2.3-1+cuda7.5                         amd64        NVIDIA       Collectives Communication Library (NCCL) Runtime
ii  python-caffe-nv                    0.15.9-1+cuda7.5                        amd64        Fast o      pen framework for Deep Learning (Python)
ii  torch7-nv                          0.9.99-1+cuda7.5                        amd64        NVidia       Torch Bundle (with CUDA). Made for DIGITS.

@lukeyeager
Copy link
Member

Software looks fine. I bet it's a GPU and/or system problem.

Do you have a particularly fancy motherboard? See NVIDIA/caffe#10 - that might be related.

@korabelnikov
Copy link
Author

@lukeyeager thanks, i will try

@korabelnikov
Copy link
Author

@mpkh , please take a look

@korabelnikov
Copy link
Author

NVIDIA/caffe#10 it's solve the issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants