Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin CUDA docker image for GPU CI #12373

Merged
merged 117 commits into from
May 6, 2022
Merged
Show file tree
Hide file tree
Changes from 116 commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
66cb20e
unpin CUDA docker image for GPU CI
Borda Mar 18, 2022
599a0e5
Apply suggestions from code review
akihironitta Mar 18, 2022
a43fe66
Merge branch 'master' into ci/gpu
akihironitta Mar 25, 2022
4c8c6e3
Merge branch 'master' into ci/gpu
akihironitta Mar 26, 2022
eee00cb
Merge branch 'master' into ci/gpu
akihironitta Mar 27, 2022
d000885
pytest fail fast
akihironitta Mar 27, 2022
6c4e0cb
Show error stacktrace
akihironitta Mar 27, 2022
f07e27f
Install gdb
akihironitta Mar 27, 2022
c536086
sudo install gdb
akihironitta Mar 27, 2022
5e85e1d
Revert "sudo install gdb"
akihironitta Mar 28, 2022
e581ffb
Revert "Install gdb"
akihironitta Mar 28, 2022
fde7a90
Revert "Show error stacktrace"
akihironitta Mar 28, 2022
956257b
Show stacktrace
akihironitta Mar 28, 2022
85f709b
Revert "pytest fail fast"
akihironitta Mar 28, 2022
2bd1ab1
Merge branch 'master' into ci/gpu
akihironitta Mar 29, 2022
23f19d6
Merge branch 'master' into ci/gpu
akihironitta Mar 29, 2022
64e3478
Revert "Show stacktrace"
akihironitta Mar 29, 2022
bca3a3b
Temporarily trim gpu testing
akihironitta Mar 29, 2022
2f71d5d
Install and run gdb
akihironitta Mar 29, 2022
1df7125
Update gdb installation
akihironitta Mar 29, 2022
007ef98
Merge branch 'master' into ci/gpu
akihironitta Apr 9, 2022
d3f0e36
Merge branch 'master' into ci/gpu
akihironitta Apr 10, 2022
16f7971
install sudo
akihironitta Apr 10, 2022
eb9e872
sudo apt install gdb
akihironitta Apr 10, 2022
b8a19ec
Fix bagua version
akihironitta Apr 10, 2022
b4b904d
pytest -q
akihironitta Apr 10, 2022
c22eb9f
Revert tmp changes
akihironitta Apr 10, 2022
3e3736a
Apply suggestions from code review
Borda Apr 10, 2022
a25b444
Merge branch 'master' into ci/gpu
Borda Apr 12, 2022
441be7f
Merge branch 'master' into ci/gpu
akihironitta Apr 14, 2022
eb05b0c
Increase timeout minutes
akihironitta Apr 14, 2022
4f2567b
Skip tests requiring bf16 cuda suport
akihironitta Apr 14, 2022
d31b9e8
Merge branch 'master' into ci/gpu
Borda Apr 18, 2022
486aec8
Merge branch 'ci/gpu' of https://github.com/PyTorchLightning/pytorch-…
Borda Apr 18, 2022
36a1624
Merge branch 'master' into ci/gpu
akihironitta Apr 22, 2022
eb82963
Try pt1.11
akihironitta Apr 22, 2022
11c5348
Increase shared memory size
akihironitta Apr 22, 2022
a3da197
Revert "Enable inference mode for evaluation (#12715)"
akihironitta Apr 22, 2022
866e4be
Don't fail fast
akihironitta Apr 22, 2022
eee7e2e
Use new rank_zero_debug
akihironitta Apr 22, 2022
abd86ed
Fix and move import statement to the top
akihironitta Apr 22, 2022
21b57a9
Revert "Revert "Enable inference mode for evaluation (#12715)""
akihironitta Apr 22, 2022
eb29d0e
Increase recursion limit
akihironitta Apr 22, 2022
9ae85b0
Prune standard tests
akihironitta Apr 22, 2022
fe5fe42
Revert "Increase recursion limit"
akihironitta Apr 22, 2022
dfe387e
check installation
Apr 22, 2022
cd66935
Revert "check installation"
akihironitta Apr 23, 2022
315efad
Fix deepspeed installation
akihironitta Apr 22, 2022
8e07366
Adapt to deepspeed>=0.5.9
Apr 23, 2022
8414874
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 23, 2022
536ea14
Update fairscale version
akihironitta Apr 23, 2022
a85584c
Skip deepspeed==0.6.{0,1}
akihironitta Apr 23, 2022
95ae5ce
Pin deepspeed<0.6.0
akihironitta Apr 23, 2022
6992da7
Merge branch 'master' into ci/gpu
akihironitta Apr 24, 2022
2d47134
Merge branch 'master' into ci/gpu
akihironitta Apr 25, 2022
40055b1
Fix ddp_comm_hook tests
Apr 25, 2022
3c47727
Merge branch 'ci/gpu' of https://github.com/PyTorchLightning/pytorch-…
Apr 25, 2022
186a612
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2022
0407de0
Fix ddp_comm_hook tests
Apr 25, 2022
12c027a
Merge branch 'ci/gpu' of https://github.com/PyTorchLightning/pytorch-…
Apr 25, 2022
6bb66b4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2022
f6727d8
Disable spot instance
akihironitta Apr 25, 2022
9e665a1
Fix ddp_comm_hook tests
akihironitta Apr 25, 2022
ba64da9
Fix ddp_comm_hook tests
akihironitta Apr 25, 2022
be57469
Merge branch 'ci/gpu' of https://github.com/PyTorchLightning/pytorch-…
Apr 25, 2022
2cfc206
Refactor ddp_comm_hook tests
Apr 25, 2022
6094de8
Remove old ddp_comm_hook tests
Apr 25, 2022
29f563c
Revert "Remove old ddp_comm_hook tests"
akihironitta Apr 26, 2022
6941f0b
Revert "Refactor ddp_comm_hook tests"
akihironitta Apr 26, 2022
dc62831
Refactor ddp_comm_hook tests
akihironitta Apr 26, 2022
ea22f02
Unpin deepspeed
akihironitta Apr 27, 2022
7bf29aa
Merge branch 'master' into ci/gpu
akihironitta Apr 27, 2022
42bb05d
Avoid pip install==0.6.2
akihironitta Apr 27, 2022
d354815
Update gpu-tests.yml
Borda Apr 27, 2022
05d1c22
Apply suggestions from code review
Borda Apr 27, 2022
7f3a45f
Revert "Avoid pip install==0.6.2"
akihironitta Apr 27, 2022
581b5de
deepspeed>=0.6.3
akihironitta Apr 27, 2022
afb26ce
Merge branch 'ci/gpu' of github.com:PyTorchLightning/pytorch-lightnin…
akihironitta Apr 27, 2022
8507662
Revert timeout
akihironitta Apr 28, 2022
12b15ad
Revert shared memory size
akihironitta Apr 28, 2022
2351c38
Merge branch 'master' into ci/gpu
akihironitta Apr 28, 2022
d898175
Fix materialization
akihironitta Apr 29, 2022
65bfe9c
Revert "Fix materialization"
akihironitta May 2, 2022
978d691
Use torch1.8
akihironitta May 2, 2022
9031cb2
Revert "Prune standard tests"
akihironitta May 2, 2022
454b1f3
Add cuda-py3.7-torch1.8
akihironitta May 3, 2022
720417a
Temporarily enable pushing docker image on the PR
akihironitta May 3, 2022
a8080f5
Downgrade ubuntu20.04 to 18.04
akihironitta May 3, 2022
74c71cc
Revert "Downgrade ubuntu20.04 to 18.04"
akihironitta May 3, 2022
424503e
Revert "Temporarily enable pushing docker image on the PR"
akihironitta May 3, 2022
7611496
Revert "Add cuda-py3.7-torch1.8"
akihironitta May 3, 2022
79c2c44
Use LTS
akihironitta May 4, 2022
42c3479
Update tests/standalone_tests.sh
akihironitta May 4, 2022
d79b81c
Build cuda10.2-py3.7-torch1.8
akihironitta May 4, 2022
d47f60e
Exclude ampere arch in cuda 10.2 build
akihironitta May 4, 2022
f7eceb6
Revert change in conda dockerfile
akihironitta May 4, 2022
b4ea779
Use cuda 10.2 in GPU CI
akihironitta May 4, 2022
c9ee309
Disable gpu testing
akihironitta May 4, 2022
dd6db0a
Exclude ampere arch in cuda 10.2 build
akihironitta May 4, 2022
87cc404
Merge branch 'master' into ci/gpu
akihironitta May 4, 2022
5a9b27c
Fix env var
akihironitta May 5, 2022
97a53a0
Exclude CUDA 10.2
akihironitta May 5, 2022
7518bd3
Update comment
akihironitta May 5, 2022
10dc191
push to docker hub
akihironitta May 5, 2022
9220e65
Revert "Disable gpu testing"
akihironitta May 5, 2022
b22b763
Revert "push to docker hub"
akihironitta May 5, 2022
38a854e
pytest detailed info
akihironitta May 5, 2022
3fb3b8c
horovodrun --check-build
akihironitta May 5, 2022
537d5ff
Install dependencies for horovod testing
akihironitta May 5, 2022
33f27d3
Revert "Revert "push to docker hub""
akihironitta May 5, 2022
b2747c6
Revert "Revert "Revert "push to docker hub"""
akihironitta May 5, 2022
867d2e4
Update comment
akihironitta May 5, 2022
4066815
Revert "pytest detailed info"
akihironitta May 5, 2022
ea9b01b
Install PyTorch LTS in benchmarks
akihironitta May 5, 2022
e4d3588
pls stop draining my energy/time/patience...
akihironitta May 5, 2022
3c528ba
:crossed_fingers:
akihironitta May 5, 2022
520c115
Add UBUNTU_VERSION
akihironitta May 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions .azure-pipelines/gpu-benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,18 @@ jobs:
cancelTimeoutInMinutes: "2"
pool: azure-gpus-spot
container:
# TODO: Unpin sha256
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.8@sha256:b75de74d4c7c820f442f246be8500c93f8b5797b84aa8531847e5fb317ed3dda"
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.8"
Borda marked this conversation as resolved.
Show resolved Hide resolved
options: "--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --shm-size=32g"
workspace:
clean: all

steps:
- bash: |
# TODO: Prepare a docker image with 1.8.2 (LTS) installed and remove manual installation.
Borda marked this conversation as resolved.
Show resolved Hide resolved
pip install torch==1.8.2+cu102 torchvision==0.9.2+cu102 torchtext==0.9.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip list
displayName: 'Install PyTorch LTS'

- bash: |
python -m pytest tests/benchmarks -v --durations=0
displayName: 'Testing: benchmarks'
Expand Down
5 changes: 3 additions & 2 deletions .azure-pipelines/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ jobs:
container:
# base ML image: mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04
# run on torch 1.8 as it's the LTS version
# TODO: Unpin sha256
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.8@sha256:b75de74d4c7c820f442f246be8500c93f8b5797b84aa8531847e5fb317ed3dda"
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.8"
Borda marked this conversation as resolved.
Show resolved Hide resolved
# default shm size is 64m. Increase it to avoid:
# 'Error while creating shared memory: unhandled system error, NCCL version 2.7.8'
options: "--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --shm-size=512m"
Expand All @@ -55,6 +54,8 @@ jobs:
CUDA_VERSION_MM=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
pip install "bagua-cuda$CUDA_VERSION_MM>=0.9.0"
pip install . --requirement requirements/devel.txt
# TODO: Prepare a docker image with 1.8.2 (LTS) installed and remove manual installation.
pip install torch==1.8.2+cu102 torchvision==0.9.2+cu102 torchtext==0.9.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip list
displayName: 'Install dependencies'

Expand Down
1 change: 1 addition & 0 deletions .github/workflows/ci_dockers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ jobs:
matrix:
include:
# the config used in '.azure-pipelines/gpu-tests.yml'
- {python_version: "3.7", pytorch_version: "1.8", cuda_version: "10.2"}
- {python_version: "3.7", pytorch_version: "1.10", cuda_version: "11.1"}
- {python_version: "3.7", pytorch_version: "1.11", cuda_version: "11.3.1"}
# latest (used in Tutorials)
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/events-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ jobs:
matrix:
include:
# the config used in '.azure-pipelines/gpu-tests.yml'
- {python_version: "3.7", pytorch_version: "1.8", cuda_version: "10.2"}
- {python_version: "3.7", pytorch_version: "1.10", cuda_version: "11.1"}
- {python_version: "3.7", pytorch_version: "1.11", cuda_version: "11.3.1"}
# latest (used in Tutorials)
Expand Down
11 changes: 10 additions & 1 deletion dockers/base-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@

ARG CUDA_VERSION=11.3.1

FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
# TODO: Update OS to ubuntu20.04 when dropping CUDA 10.2
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu18.04
Borda marked this conversation as resolved.
Show resolved Hide resolved

ARG PYTHON_VERSION=3.9
ARG PYTORCH_VERSION=1.8
Expand Down Expand Up @@ -47,6 +48,8 @@ RUN \
ca-certificates \
software-properties-common \
libopenmpi-dev \
openmpi-bin \
ssh \
akihironitta marked this conversation as resolved.
Show resolved Hide resolved
&& \

# Install python
Expand Down Expand Up @@ -110,10 +113,14 @@ ENV \
HOROVOD_WITH_MPI=1

RUN \
# CUDA 10.2 doesn't support ampere architecture (8.0).
if [[ "$CUDA_VERSION" < "11.0" ]]; then export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST//";8.0"/}; echo $TORCH_CUDA_ARCH_LIST; fi && \
HOROVOD_BUILD_CUDA_CC_LIST=${TORCH_CUDA_ARCH_LIST//";"/","} && \
export HOROVOD_BUILD_CUDA_CC_LIST=${HOROVOD_BUILD_CUDA_CC_LIST//"."/""} && \
echo $HOROVOD_BUILD_CUDA_CC_LIST && \
cmake --version && \
pip install --no-cache-dir -r ./requirements/strategies.txt && \
horovodrun --check-build && \
rm -rf requirements/

RUN \
Expand All @@ -127,6 +134,8 @@ RUN \
fi

RUN \
# CUDA 10.2 doesn't support ampere architecture (8.0).
if [[ "$CUDA_VERSION" < "11.0" ]]; then export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST//";8.0"/}; echo $TORCH_CUDA_ARCH_LIST; fi && \
# install NVIDIA apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex/archive/refs/heads/master.zip && \
python -c "from apex import amp"
Expand Down
2 changes: 1 addition & 1 deletion tests/models/test_amp.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def test_amp_cpus(tmpdir, strategy, precision, devices):

@RunIf(min_gpus=2, min_torch="1.10")
@pytest.mark.parametrize("strategy", [None, "dp", "ddp_spawn"])
@pytest.mark.parametrize("precision", [16, "bf16"])
@pytest.mark.parametrize("precision", [16, pytest.param("bf16", marks=RunIf(bf16_cuda=True))])
@pytest.mark.parametrize("devices", [1, 2])
def test_amp_gpus(tmpdir, strategy, precision, devices):
"""Make sure combinations of AMP and strategies work if supported."""
Expand Down