Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin Docker image for testing on GPUs #12368

Merged
merged 3 commits into from
Mar 18, 2022
Merged

Conversation

akihironitta
Copy link
Contributor

@akihironitta akihironitta commented Mar 17, 2022

What does this PR do?

Temporary fix for #12314. For unblocking other PRs, tries to pin the most recent docker image in which the tests were running successfully.

Specifically, the one used in https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=61012&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=bd0a18db-4e66-438b-93e9-00d86028355e.

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • [n/a] Did you write any new necessary tests? (not for typos and docs)
  • [n/a] Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • [n/a] Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @tchaton @rohitgr7 @akihironitta @carmocca @Borda

@akihironitta akihironitta changed the title [wip] Pin docker image sha Pin Docker image for testing on GPUs Mar 17, 2022
@akihironitta akihironitta added bug Something isn't working priority: 0 High priority task ci Continuous Integration labels Mar 17, 2022
@akihironitta akihironitta marked this pull request as ready for review March 17, 2022 23:05
@akihironitta
Copy link
Contributor Author

akihironitta commented Mar 17, 2022

@daniellepintz daniellepintz added this to the 1.6 milestone Mar 17, 2022
Copy link
Contributor

@daniellepintz daniellepintz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akihironitta!

@mergify mergify bot added the ready PRs ready to be merged label Mar 18, 2022
@mergify mergify bot requested a review from a team March 18, 2022 00:03
@Borda
Copy link
Member

Borda commented Mar 18, 2022

also curious how we uploaded corrupted image as each image at the end shall have its own testing...

@akihironitta akihironitta enabled auto-merge (squash) March 18, 2022 00:16
@akihironitta
Copy link
Contributor Author

@Borda The corrupted image doesn't raise any error as seen in this "successful run" build-CUDA (3.7, 1.8) for instance. Building itself was successful, but since there's no sanity check (import horovod.torch raises a warning but not an exception) at the moment in the Dockerfile, the built image got uploaded to the hub. As you've already done in your PR, something like from horovod.torch import nccl_built; nccl_built() should avoid this issue since it raises an exception if the dependency isn't installed in the image.

@akihironitta akihironitta mentioned this pull request Mar 18, 2022
12 tasks
@akihironitta akihironitta merged commit b8b855d into master Mar 18, 2022
@akihironitta akihironitta deleted the ci/fix-cuda-horovod2 branch March 18, 2022 01:16
@awaelchli
Copy link
Contributor

Thank you so much @akihironitta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants