Skip to content

Conversation

@ydcjeff
Copy link

@ydcjeff ydcjeff commented Sep 18, 2020

What does this PR do?

Fixes #2090
Creates .drone.jsonnet for multiple testing

DO NOT MERGE THIS UNTIL WE SCALE UP DRONE

@ydcjeff ydcjeff marked this pull request as draft September 18, 2020 04:13
@Borda Borda added feature Is an improvement or enhancement ci Continuous Integration labels Sep 18, 2020
@Borda Borda self-requested a review September 18, 2020 07:39
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda Borda changed the title Add Torch nightly GPU CI [blocked by #3541]Add Torch nightly GPU CI Sep 18, 2020
@ydcjeff
Copy link
Author

ydcjeff commented Sep 18, 2020

let's write it as jsonnet script because just one parameter changes, right?

* https://discourse.drone.io/t/porting-matrix-builds-to-1-0-multi-machine-pipelines/2966

* https://github.com/suzuki-shunsuke/drone-jsonnet-generator

Yes, only docker image changes.
Okay, will take a look

@Borda Borda mentioned this pull request Sep 18, 2020
7 tasks
@Borda
Copy link
Collaborator

Borda commented Sep 18, 2020

I think that then we would need to coordinate, because when you add it, ping me and would need to a temporary change the Drone config file, so lets keep the old one and add this jsonnet as a new file...

@ydcjeff ydcjeff changed the title [blocked by #3541]Add Torch nightly GPU CI [blocked by #3658]Add Torch nightly GPU CI Sep 25, 2020
@Borda Borda changed the title [blocked by #3658]Add Torch nightly GPU CI Add Torch nightly GPU CI Oct 1, 2020
@Borda
Copy link
Collaborator

Borda commented Oct 1, 2020

@ydcjeff we can move forward...

@ydcjeff
Copy link
Author

ydcjeff commented Oct 1, 2020

@Borda btw, are we going to run them one after another or in parallel?

@Borda
Copy link
Collaborator

Borda commented Oct 1, 2020

well mi mind wait with this one as we are reaching some computational limits on Drone, sometimes. now we have a queue for about 2hours and with running this one it will be even double... so let's prepare it but wait with merging it...

@ydcjeff
Copy link
Author

ydcjeff commented Oct 1, 2020

Yea, I am fine with anyway. I have now setup drone to run one after another in jsonnet.
Here's jsonnet to yaml output.

---
kind: pipeline
type: docker
name: torch-GPU

platform:
  os: linux
  arch: amd64

steps:
- name: testing
  image: pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.6
  commands:
  - export PATH=$PATH:/root/.local/bin
  - python --version
  - pip install pip -U
  - pip --version
  - nvidia-smi
  - apt-get update && apt-get install -y cmake
  - pip install -r ./requirements/base.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/devel.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/examples.txt -q --upgrade-strategy only-if-needed
  - pip list
  - python -c 'import torch ; print(' & '.join([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())]) if torch.cuda.is_available() else 'only CPU')'
  - coverage run --source pytorch_lightning -m py.test pytorch_lightning tests -v --durations=25
  - python -m py.test benchmarks pl_examples -v --maxfail=2 --durations=0
  - coverage report
  - codecov --token $CODECOV_TOKEN --flags=gpu,pytest --name='GPU-coverage' --env=linux --build $DRONE_BUILD_NUMBER --commit $DRONE_COMMIT
  - python tests/collect_env_details.py
  environment:
    CODECOV_TOKEN:
      from_secret: codecov_token
    HOROVOD_GPU_OPERATIONS: NCCL
    HOROVOD_WITHOUT_MPI: 1
    HOROVOD_WITHOUT_MXNET: 1
    HOROVOD_WITHOUT_TENSORFLOW: 1
    HOROVOD_WITH_GLOO: 1
    HOROVOD_WITH_PYTORCH: 1
    MKL_THREADING_LAYER: GNU
    SLURM_LOCALID: 0

trigger:
  branch:
  - master
  event:
  - push
  - pull_request

---
kind: pipeline
type: docker
name: torch-GPU-nightly

platform:
  os: linux
  arch: amd64

steps:
- name: testing
  image: pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.7
  commands:
  - export PATH=$PATH:/root/.local/bin
  - python --version
  - pip install pip -U
  - pip --version
  - nvidia-smi
  - apt-get update && apt-get install -y cmake
  - pip install -r ./requirements/base.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/devel.txt -q --upgrade-strategy only-if-needed
  - pip install -r ./requirements/examples.txt -q --upgrade-strategy only-if-needed
  - pip list
  - python -c 'import torch ; print(' & '.join([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())]) if torch.cuda.is_available() else 'only CPU')'
  - coverage run --source pytorch_lightning -m py.test pytorch_lightning tests -v --durations=25
  - python -m py.test benchmarks pl_examples -v --maxfail=2 --durations=0
  - coverage report
  - codecov --token $CODECOV_TOKEN --flags=gpu,pytest --name='GPU-coverage' --env=linux --build $DRONE_BUILD_NUMBER --commit $DRONE_COMMIT
  - python tests/collect_env_details.py
  environment:
    CODECOV_TOKEN:
      from_secret: codecov_token
    HOROVOD_GPU_OPERATIONS: NCCL
    HOROVOD_WITHOUT_MPI: 1
    HOROVOD_WITHOUT_MXNET: 1
    HOROVOD_WITHOUT_TENSORFLOW: 1
    HOROVOD_WITH_GLOO: 1
    HOROVOD_WITH_PYTORCH: 1
    MKL_THREADING_LAYER: GNU
    SLURM_LOCALID: 0

trigger:
  branch:
  - master
  event:
  - push
  - pull_request

depends_on:
- torch-GPU

...

@ydcjeff
Copy link
Author

ydcjeff commented Oct 1, 2020

well mi mind wait with this one as we are reaching some computational limits on Drone, sometimes. now we have a queue for about 2hours and with running this one it will be even double... so let's prepare it but wait with merging it...

@Borda regarding with testing nightly, would it be better to start support PT nightly version ahead of one month before stable launch?
also draft those nightly PRs and test them without merging so we do not need to do the follow PRs for fixing. I thought this could save some GPU resources.

For now, there were already 2 PRs for PyTorch nightly, so started thinking about that...

What do you think?

@codecov
Copy link

codecov bot commented Oct 1, 2020

Codecov Report

Merging #3543 (5898c64) into master (6831ba9) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #3543   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         118     118           
  Lines        9018    9018           
======================================
  Hits         8389    8389           
  Misses        629     629           

@ydcjeff ydcjeff changed the title Add Torch nightly GPU CI Add Torch nightly GPU CI [CI SKIP] Oct 2, 2020
@ydcjeff ydcjeff changed the title Add Torch nightly GPU CI [CI SKIP] Add Torch nightly GPU CI Oct 2, 2020
@edenlightning edenlightning added this to the 1.1 milestone Oct 4, 2020
@edenlightning
Copy link
Contributor

hey @ydcjeff hows it going? any way we can help?

@Borda
Copy link
Collaborator

Borda commented Oct 20, 2020

@edenlightning we can merge it as it now, but to activate it (switch source in Drone config) we need to get scalable Drone testing running because this basically double the number of performed tests and already now we are full...

@stale
Copy link

stale bot commented Nov 3, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 3, 2020
@ydcjeff ydcjeff removed the won't fix This will not be worked on label Nov 4, 2020
@ydcjeff ydcjeff changed the title Add Torch nightly GPU CI [CI] gpu test with pytorch nightly Nov 4, 2020
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update after #3658

@Borda
Copy link
Collaborator

Borda commented Nov 5, 2020

This is not tested, but still, I would merge it as using it as a baseline and debug it later...
For testing, we need a temporary switch Drone setting and see if it works...

@ydcjeff
Copy link
Author

ydcjeff commented Nov 5, 2020

Okay 👌

@ydcjeff
Copy link
Author

ydcjeff commented Nov 23, 2020

I messed up with my forked repo's master and re-forked again.
Let me know if you want to merge, I can create a new one.

@Borda Borda added the ready PRs ready to be merged label Nov 24, 2020
@Borda
Copy link
Collaborator

Borda commented Dec 4, 2020

@ydcjeff seems this is not valid as you lost the original fork, mind create it again and refer to this already approved PR 🐰

@Borda Borda closed this Dec 4, 2020
@ydcjeff ydcjeff mentioned this pull request Dec 4, 2020
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration feature Is an improvement or enhancement ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[test] Add tests against PT nightly using GPUs

6 participants