[CI] gpu test with pytorch nightly #3543

ydcjeff · 2020-09-18T04:13:08Z

What does this PR do?

Fixes #2090
Creates .drone.jsonnet for multiple testing

DO NOT MERGE THIS UNTIL WE SCALE UP DRONE

Borda

let's write it as jsonnet script because just one parameter changes, right?

ydcjeff · 2020-09-18T07:47:28Z

let's write it as jsonnet script because just one parameter changes, right?
* https://discourse.drone.io/t/porting-matrix-builds-to-1-0-multi-machine-pipelines/2966

* https://github.com/suzuki-shunsuke/drone-jsonnet-generator

Yes, only docker image changes.
Okay, will take a look

Borda · 2020-09-18T12:48:24Z

I think that then we would need to coordinate, because when you add it, ping me and would need to a temporary change the Drone config file, so lets keep the old one and add this jsonnet as a new file...

Borda · 2020-10-01T14:59:18Z

@ydcjeff we can move forward...

ydcjeff · 2020-10-01T15:40:17Z

@Borda btw, are we going to run them one after another or in parallel?

Borda · 2020-10-01T17:24:39Z

well mi mind wait with this one as we are reaching some computational limits on Drone, sometimes. now we have a queue for about 2hours and with running this one it will be even double... so let's prepare it but wait with merging it...

ydcjeff · 2020-10-01T17:25:57Z

Yea, I am fine with anyway. I have now setup drone to run one after another in jsonnet.
Here's jsonnet to yaml output.

---
kind: pipeline
type: docker
name: torch-GPU

platform:
  os: linux
  arch: amd64

steps:
- name: testing
  image: pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.6
  commands:
  - export PATH=$PATH:/root/.local/bin
  - python --version
  - pip install pip -U
  - pip --version
  - nvidia-smi
  - apt-get update && apt-get install -y cmake
  - pip install -r ./requirements/base.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/devel.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/examples.txt -q --upgrade-strategy only-if-needed
  - pip list
  - python -c 'import torch ; print(' & '.join([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())]) if torch.cuda.is_available() else 'only CPU')'
  - coverage run --source pytorch_lightning -m py.test pytorch_lightning tests -v --durations=25
  - python -m py.test benchmarks pl_examples -v --maxfail=2 --durations=0
  - coverage report
  - codecov --token $CODECOV_TOKEN --flags=gpu,pytest --name='GPU-coverage' --env=linux --build $DRONE_BUILD_NUMBER --commit $DRONE_COMMIT
  - python tests/collect_env_details.py
  environment:
    CODECOV_TOKEN:
      from_secret: codecov_token
    HOROVOD_GPU_OPERATIONS: NCCL
    HOROVOD_WITHOUT_MPI: 1
    HOROVOD_WITHOUT_MXNET: 1
    HOROVOD_WITHOUT_TENSORFLOW: 1
    HOROVOD_WITH_GLOO: 1
    HOROVOD_WITH_PYTORCH: 1
    MKL_THREADING_LAYER: GNU
    SLURM_LOCALID: 0

trigger:
  branch:
  - master
  event:
  - push
  - pull_request

---
kind: pipeline
type: docker
name: torch-GPU-nightly

platform:
  os: linux
  arch: amd64

steps:
- name: testing
  image: pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.7
  commands:
  - export PATH=$PATH:/root/.local/bin
  - python --version
  - pip install pip -U
  - pip --version
  - nvidia-smi
  - apt-get update && apt-get install -y cmake
  - pip install -r ./requirements/base.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/devel.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/examples.txt -q --upgrade-strategy only-if-needed
  - pip list
  - python -c 'import torch ; print(' & '.join([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())]) if torch.cuda.is_available() else 'only CPU')'
  - coverage run --source pytorch_lightning -m py.test pytorch_lightning tests -v --durations=25
  - python -m py.test benchmarks pl_examples -v --maxfail=2 --durations=0
  - coverage report
  - codecov --token $CODECOV_TOKEN --flags=gpu,pytest --name='GPU-coverage' --env=linux --build $DRONE_BUILD_NUMBER --commit $DRONE_COMMIT
  - python tests/collect_env_details.py
  environment:
    CODECOV_TOKEN:
      from_secret: codecov_token
    HOROVOD_GPU_OPERATIONS: NCCL
    HOROVOD_WITHOUT_MPI: 1
    HOROVOD_WITHOUT_MXNET: 1
    HOROVOD_WITHOUT_TENSORFLOW: 1
    HOROVOD_WITH_GLOO: 1
    HOROVOD_WITH_PYTORCH: 1
    MKL_THREADING_LAYER: GNU
    SLURM_LOCALID: 0

trigger:
  branch:
  - master
  event:
  - push
  - pull_request

depends_on:
- torch-GPU

...

ydcjeff · 2020-10-01T17:35:28Z

well mi mind wait with this one as we are reaching some computational limits on Drone, sometimes. now we have a queue for about 2hours and with running this one it will be even double... so let's prepare it but wait with merging it...

@Borda regarding with testing nightly, would it be better to start support PT nightly version ahead of one month before stable launch?
also draft those nightly PRs and test them without merging so we do not need to do the follow PRs for fixing. I thought this could save some GPU resources.

For now, there were already 2 PRs for PyTorch nightly, so started thinking about that...

What do you think?

codecov · 2020-10-01T17:36:56Z

Codecov Report

Merging #3543 (5898c64) into master (6831ba9) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #3543   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         118     118           
  Lines        9018    9018           
======================================
  Hits         8389    8389           
  Misses        629     629

drop --user

edenlightning · 2020-10-20T19:04:25Z

hey @ydcjeff hows it going? any way we can help?

Borda · 2020-10-20T19:09:52Z

@edenlightning we can merge it as it now, but to activate it (switch source in Drone config) we need to get scalable Drone testing running because this basically double the number of performed tests and already now we are full...

stale · 2020-11-03T23:09:32Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

Borda

update after #3658

.drone.jsonnet

Borda · 2020-11-05T11:01:39Z

This is not tested, but still, I would merge it as using it as a baseline and debug it later...
For testing, we need a temporary switch Drone setting and see if it works...

ydcjeff · 2020-11-05T11:14:57Z

Okay 👌

.drone.jsonnet

ydcjeff · 2020-11-23T12:12:05Z

I messed up with my forked repo's master and re-forked again.
Let me know if you want to merge, I can create a new one.

Borda · 2020-12-04T08:34:34Z

@ydcjeff seems this is not valid as you lost the original fork, mind create it again and refer to this already approved PR 🐰

ydcjeff marked this pull request as draft September 18, 2020 04:13

Borda added feature Is an improvement or enhancement ci Continuous Integration labels Sep 18, 2020

Borda self-requested a review September 18, 2020 07:39

Borda reviewed Sep 18, 2020

View reviewed changes

Borda changed the title ~~Add Torch nightly GPU CI~~ [blocked by #3541]Add Torch nightly GPU CI Sep 18, 2020

Borda mentioned this pull request Sep 18, 2020

Enable PyTorch 1.7 in conda CI #3541

Merged

7 tasks

ydcjeff changed the title ~~[blocked by #3541]Add Torch nightly GPU CI~~ [blocked by #3658]Add Torch nightly GPU CI Sep 25, 2020

Borda changed the title ~~[blocked by #3658]Add Torch nightly GPU CI~~ Add Torch nightly GPU CI Oct 1, 2020

Borda approved these changes Oct 1, 2020

View reviewed changes

ydcjeff added 3 commits October 2, 2020 17:33

pt nightly gpu test

9ccf500

drop --user

add .drone.jsonnet

26df33f

use HOROVOD_GPU_OPERATIONS

3ef48d0

ydcjeff changed the title ~~Add Torch nightly GPU CI~~ Add Torch nightly GPU CI [CI SKIP] Oct 2, 2020

ydcjeff changed the title ~~Add Torch nightly GPU CI [CI SKIP]~~ Add Torch nightly GPU CI Oct 2, 2020

normal .drone.yml [ci skip]

c0bc6b0

edenlightning added this to the 1.1 milestone Oct 4, 2020

stale bot added the won't fix This will not be worked on label Nov 3, 2020

ydcjeff removed the won't fix This will not be worked on label Nov 4, 2020

ydcjeff requested review from SeanNaren, tchaton, teddykoker and williamFalcon as code owners November 4, 2020 08:31

ydcjeff changed the title ~~Add Torch nightly GPU CI~~ [CI] gpu test with pytorch nightly Nov 4, 2020

exclude jsonnet

141f216

awaelchli approved these changes Nov 5, 2020

View reviewed changes

Borda reviewed Nov 5, 2020

View reviewed changes

.drone.jsonnet Outdated Show resolved Hide resolved

.drone.jsonnet Outdated Show resolved Hide resolved

.drone.jsonnet Outdated Show resolved Hide resolved

Apply suggestions from code review

ce095be

Borda reviewed Nov 5, 2020

View reviewed changes

.drone.jsonnet Outdated Show resolved Hide resolved

Apply suggestions from code review

00bbb7f

Borda reviewed Nov 5, 2020

View reviewed changes

.drone.jsonnet Outdated Show resolved Hide resolved

Apply suggestions from code review

6a26b5e

Merge branch 'master' into tests/pt-1.7-gpu

13df3c9

Merge branch 'master' into tests/pt-1.7-gpu

44272e0

ydcjeff commented Nov 7, 2020

View reviewed changes

.drone.jsonnet Outdated Show resolved Hide resolved

.drone.jsonnet Outdated Show resolved Hide resolved

Jeff Yang added 5 commits November 7, 2020 13:45

Apply suggestions from code review

a3faee9

Merge branch 'master' into tests/pt-1.7-gpu

2de2035

Merge branch 'master' into tests/pt-1.7-gpu

f72d96c

Merge branch 'master' into tests/pt-1.7-gpu

5648e99

Merge branch 'master' into tests/pt-1.7-gpu

5898c64

Borda added the ready PRs ready to be merged label Nov 24, 2020

tchaton approved these changes Nov 25, 2020

View reviewed changes

SeanNaren approved these changes Nov 25, 2020

View reviewed changes

Borda closed this Dec 4, 2020

ydcjeff mentioned this pull request Dec 4, 2020

[ci] pytorch nightly gpu test #4968

Merged

11 tasks

[CI] gpu test with pytorch nightly #3543

[CI] gpu test with pytorch nightly #3543

Uh oh!

Conversation

ydcjeff commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

ydcjeff commented Sep 18, 2020

Uh oh!

Borda commented Sep 18, 2020

Uh oh!

Borda commented Oct 1, 2020

Uh oh!

ydcjeff commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented Oct 1, 2020

Uh oh!

ydcjeff commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydcjeff commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

edenlightning commented Oct 20, 2020

Uh oh!

Borda commented Oct 20, 2020

Uh oh!

stale bot commented Nov 3, 2020

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Borda commented Nov 5, 2020

Uh oh!

ydcjeff commented Nov 5, 2020

Uh oh!

Uh oh!

Uh oh!

ydcjeff commented Nov 23, 2020

Uh oh!

Borda commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ydcjeff commented Sep 18, 2020 •

edited

Loading

ydcjeff commented Oct 1, 2020 •

edited

Loading

ydcjeff commented Oct 1, 2020 •

edited

Loading

ydcjeff commented Oct 1, 2020 •

edited

Loading

codecov bot commented Oct 1, 2020 •

edited

Loading