uploading images #24

ngam · 2022-05-11T12:18:48Z

I will try to upload some images later this week. We can at least document the process for interested community members if they have access to V100 or A100 GPUs and want some more performance!

Originally posted by @ngam in pangeo-data/pangeo-docker-images#320 (comment)

ngam · 2022-05-11T12:19:08Z

@weiji14, let me know if you get a chance to test them.

docker pull ngam00/ngc-pt-pangeo

note 00 above in username; alternatively when it finishes uploading to the gh registry:

docker pull ghcr.io/ngam/ngc-pt-pangeo

missing packages from these images are here: #21. I haven't had a chance to run any benchmarks yet, but I will looking into that soon...

weiji14 · 2022-05-13T21:03:56Z

Cool, thanks @ngam, I'll try and give this a spin on my GPU over the weekend. Is there a good benchmark you'd recommend to test this on? Preferrably something light that takes <16GB of GPU RAM.

ngam · 2022-05-16T20:45:50Z

Sorry I didn't respond here... Not really sure about benchmarks, I usually really only run my own models and usually in tensorflow

let me know if you managed to get something going

weiji14 · 2022-05-16T20:54:16Z

Ok, found an easy-ish benchmark script at https://github.com/cresset-template/cresset/blob/7762a947ff567003befbab3d217364f9fcf98b67/benchmark.py. To run it, do:

git clone https://github.com/cresset-template/cresset.git
cd cresset/

Below are the tests I ran on an NVIDIA RTX A5000 Laptop GPU, only thing I changed was the docker image (ghcr.io/ngam/ngc-pt-pangeo vs pangeo/pytorch-notebook:2022.05.1).

NGC-based `ghcr.io/ngam/ngc-pt-pangeo`

docker run -it --rm \
           --gpus all \
           --volume $PWD:/home/jovyan \
           ghcr.io/ngam/ngc-pt-pangeo \
           python /home/jovyan/benchmark.py

Results:

=============
== PyTorch ==
=============

NVIDIA Release 22.04 (build 36527063)
PyTorch Version 1.12.0a0+bd13bc6

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Python Version: 3.8.13
PyTorch Version: 1.12.0a0+bd13bc6
PyTorch CUDA Version: 11.6
PyTorch cuDNN Version: 8400
PyTorch Architecture List: ('sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_86')
GPU Device Name: NVIDIA RTX A5000 Laptop GPU
GPU Compute Capability: 8.6
NVIDIA Driver Version: 510.60.02
Automatic Mixed Precision Enabled: False.
TorchScript Enabled: False.
                                                                                               
Model: r3d_18.
Input shapes: ((1, 3, 64, 128, 128),).
Average time:  39.796 milliseconds.
Total time:  41 seconds.
                                                                                               
Model: Transformer.
Input shapes: ((1, 512, 512), (1, 512, 512)).
Average time:   5.212 milliseconds.
Total time:   5 seconds.
                                                                                               
Model: resnet50.
Input shapes: ((2, 3, 512, 512),).
Average time:  16.112 milliseconds.
Total time:  16 seconds.
                                                                                               
Model: vgg19.
Input shapes: ((1, 3, 512, 512),).
Average time:  16.649 milliseconds.
Total time:  17 seconds.
                                                                                               
Model: fcn_resnet50.
Input shapes: ((1, 3, 512, 512),).
Average time:  23.188 milliseconds.
Total time:  24 seconds.
                                                                                               
Model: deeplabv3_resnet50.
Input shapes: ((1, 3, 512, 512),).
Average time:  27.268 milliseconds.
Total time:  28 seconds.
                                                                                               
Model: retinanet_resnet50_fpn.
Input shapes: ((1, 3, 512, 512),).
Average time:  41.549 milliseconds.
Total time:  43 seconds.

Pangeo's image `pangeo/pytorch-notebook:2022.05.1`

docker run -it --rm \
           --gpus all \
           --volume $PWD:/home/jovyan \
           pangeo/pytorch-notebook:2022.05.10 \
           python /home/jovyan/benchmark.py

Python Version: 3.9.12
PyTorch Version: 1.11.0
PyTorch CUDA Version: 11.2
PyTorch cuDNN Version: 8201
PyTorch Architecture List: ('sm_35', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_50')
GPU Device Name: NVIDIA RTX A5000 Laptop GPU
GPU Compute Capability: 8.6
NVIDIA Driver Version: 510.60.02
Automatic Mixed Precision Enabled: False.
TorchScript Enabled: False.
                                                                                               
Model: r3d_18.
Input shapes: ((1, 3, 64, 128, 128),).
Average time:  37.156 milliseconds.
Total time:  38 seconds.
                                                                                               
Model: Transformer.
Input shapes: ((1, 512, 512), (1, 512, 512)).
Average time:   5.373 milliseconds.
Total time:   6 seconds.
                                                                                               
Model: resnet50.
Input shapes: ((2, 3, 512, 512),).
Average time:  18.056 milliseconds.
Total time:  18 seconds.
                                                                                               
Model: vgg19.
Input shapes: ((1, 3, 512, 512),).
Average time:  17.039 milliseconds.
Total time:  17 seconds.
                                                                                               
Model: fcn_resnet50.
Input shapes: ((1, 3, 512, 512),).
Average time:  27.493 milliseconds.
Total time:  28 seconds.
                                                                                               
Model: deeplabv3_resnet50.
Input shapes: ((1, 3, 512, 512),).
Average time:  31.933 milliseconds.
Total time:  33 seconds.
                                                                                               
Model: retinanet_resnet50_fpn.
Input shapes: ((1, 3, 512, 512),).
Average time:  43.319 milliseconds.
Total time:  44 seconds.

Differences

See https://www.diffchecker.com/ZTpD1Par. Not exactly an apples to apples comparison as there are lots of library version mismatches (e.g. CUDA 11.6 vs CUDA 11.2, CUDNN 8400 vs CUDNN 8201, etc), but in general the differences seem a bit minor.

Other than the ResNet18 model where pangeo/pytorch-notebook was faster than ghcr.io/ngam/ngc-pt-pangeo by 3 seconds, it seems like ghcr.io/ngam/ngc-pt-pangeo is faster for the other models (generally more deeper/complicated ones). Biggest difference was for deeplabv3_resnet50, where ghcr.io/ngam/ngc-pt-pangeo took 33 seconds, while pangeo/pytorch-notebook took 28 seconds, or a difference of 5 seconds.

I'd be tempted to update the Pangeo notebook with newer CUDA/CUDNN/Pytorch versions to make the comparison fair before saying confidently that the NGC containers win out, but the NGC-based one is definitely in the lead right now 😃

ngam · 2022-05-16T20:57:33Z

Yes, but I'm glad it is only minor! I think what we can do is try harder to push the conda-forge feedstocks to copy the NGC builds... I'm already doing that with tensorflow

weiji14 · 2022-05-16T21:18:06Z

Yeah, but like you said, those tiny differences might add up. Say if someone was training a neural network for 1 hour, 10sec saved per minute would mean 10x60 = 600 seconds or 10 minutes less time per hour. If you expand that to 1 day/24 hours, then that's 240 minutes or 4 hours saved!

If you can pin the ngc-pt-pangeo docker image's to pytorch=1.11.0 (down from 1.12.0a0+bd13bc6), I can try to work on updating the CUDA version on the pytorch-notebook image to CUDA 11.6, then we can maybe get a fairer benchmark comparison.

ngam · 2022-05-16T21:23:23Z

1.12.0a0+bd13bc6

this weird pin is from NGC... https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04

weiji14 · 2022-05-16T21:51:26Z

1.12.0a0+bd13bc6

this weird pin is from NGC... https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04

Interesting, so they are pinning specific Pytorch commits?!! I'm usually ok with bleeding edge software, but not sure if this is ok for general Pangeo users 😅

ngam · 2022-05-17T01:50:29Z

Yeah, but like you said, those tiny differences might add up. Say if someone was training a neural network for 1 hour, 10sec saved per minute would mean 10x60 = 600 seconds or 10 minutes less time per hour. If you expand that to 1 day/24 hours, then that's 240 minutes or 4 hours saved!

You're absolutely right on this btw. Also, take into account two additional points: 1) toy models are double edged swords, they're somewhat optimized which relatively light. I suspect for an actual researcher who ends up paying close attention to performance, the saved time will be a little more. So, I don't want to discount this premise, it is very important --- this is what drove me to do this to begin with :)

ngam changed the title ~~I will try to upload some images later this week. We can at least document the process for interested community members if they have access to V100 or A100 GPUs and want some more performance!~~ uploading images May 11, 2022

ngam mentioned this issue May 11, 2022

Use cudatoolkit=11 in both tensorflow and pytorch images pangeo-data/pangeo-docker-images#320

Closed

ngam closed this as completed Jun 2, 2022

weiji14 mentioned this issue May 14, 2023

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images pangeo-data/pangeo-docker-images#457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uploading images #24

uploading images #24

ngam commented May 11, 2022

ngam commented May 11, 2022

weiji14 commented May 13, 2022

ngam commented May 16, 2022

weiji14 commented May 16, 2022 •

edited

Loading

ngam commented May 16, 2022

weiji14 commented May 16, 2022 •

edited

Loading

ngam commented May 16, 2022

weiji14 commented May 16, 2022

ngam commented May 17, 2022

uploading images #24

uploading images #24

Comments

ngam commented May 11, 2022

ngam commented May 11, 2022

weiji14 commented May 13, 2022

ngam commented May 16, 2022

weiji14 commented May 16, 2022 • edited Loading

NGC-based ghcr.io/ngam/ngc-pt-pangeo

Pangeo's image pangeo/pytorch-notebook:2022.05.1

Differences

ngam commented May 16, 2022

weiji14 commented May 16, 2022 • edited Loading

ngam commented May 16, 2022

weiji14 commented May 16, 2022

ngam commented May 17, 2022

weiji14 commented May 16, 2022 •

edited

Loading

NGC-based `ghcr.io/ngam/ngc-pt-pangeo`

Pangeo's image `pangeo/pytorch-notebook:2022.05.1`

weiji14 commented May 16, 2022 •

edited

Loading