Skip to content

Conversation

@Borda
Copy link
Collaborator

@Borda Borda commented Sep 25, 2020

What does this PR do?

Update used docker images to the one we regularly update...
Next PR: switch docker image to PT 1.6 version fro Drone testing

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

After PR merged

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added the ci Continuous Integration label Sep 25, 2020
@Borda
Copy link
Collaborator Author

Borda commented Sep 25, 2020

failing on docker using:


ImportError while loading conftest '/drone/src/tests/conftest.py'.
--
226 | tests/__init__.py:13: in <module>
227 | os.mkdir(TEMP_PATH)
228 | E   PermissionError: [Errno 13] Permission denied: '/drone/src/test_temp'
229 | Traceback (most recent call last):
230 | File "/home/flash/miniconda/envs/lightning/lib/python3.7/site-packages/coverage/cmdline.py", line 740, in do_run
231 | runner.run()
232 | File "/home/flash/miniconda/envs/lightning/lib/python3.7/site-packages/coverage/execfile.py", line 247, in run
233 | exec(code, main_mod.__dict__)
234 | File "/home/flash/miniconda/envs/lightning/lib/python3.7/site-packages/py/test.py", line 4, in <module>
235 | sys.exit(pytest.main())
236 | SystemExit: ExitCode.USAGE_ERROR

@Borda
Copy link
Collaborator Author

Borda commented Sep 25, 2020

need to rebuild the docker image in #3661

@ydcjeff
Copy link

ydcjeff commented Sep 26, 2020

Could this error be the non-root user creation?

@Borda
Copy link
Collaborator Author

Borda commented Sep 26, 2020

Could this error be the non-root user creation?

it seems so... I ll tttry to get a version with sudo like we have for XLA...

@Borda Borda marked this pull request as ready for review September 29, 2020 09:03
@mergify mergify bot requested a review from a team September 29, 2020 09:03
@Borda
Copy link
Collaborator Author

Borda commented Sep 29, 2020

@ydcjeff I have built this one locally and pushed it to Docker hub directly to test that it fine to run the tests...

@Borda
Copy link
Collaborator Author

Borda commented Sep 29, 2020

seems we have failing test for PR 1.6

FAILED tests/models/test_gpu.py::test_multi_gpu_model_ddp[variation_fit---max_epochs 1 --gpus 2 --distributed_backend ddp]
FAILED tests/models/test_gpu.py::test_multi_gpu_model_ddp[variation_test---max_epochs 1 --gpus 2 --distributed_backend ddp]

is it something you have seen? @awaelchli @williamFalcon
http://35.192.60.23/PyTorchLightning/pytorch-lightning/10779/1/2

@Borda Borda changed the title upgrade PT 1.6 version fro Drone testing [wip] upgrade PT 1.5 version fro Drone testing Sep 29, 2020
@Borda Borda changed the title [wip] upgrade PT 1.5 version fro Drone testing [blocked by #3739] upgrade PT 1.6 version fro Drone testing Sep 30, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This pull request is now in conflict... :(

@Borda Borda changed the title [blocked by #3739] upgrade PT 1.6 version fro Drone testing upgrade PT 1.6 version fro Drone testing Oct 1, 2020
@Borda Borda force-pushed the ci/drone-pt1.6 branch 2 times, most recently from db49fa9 to e739764 Compare October 1, 2020 21:21
@mergify
Copy link
Contributor

mergify bot commented Oct 2, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Oct 6, 2020

This pull request is now in conflict... :(

@Borda Borda force-pushed the ci/drone-pt1.6 branch 4 times, most recently from e15e717 to 21b635d Compare October 13, 2020 20:43
@codecov
Copy link

codecov bot commented Oct 13, 2020

Codecov Report

Merging #3658 into master will increase coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #3658   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         111     111           
  Lines        8068    8046   -22     
======================================
- Hits         7487    7467   -20     
+ Misses        581     579    -2     

@Borda
Copy link
Collaborator Author

Borda commented Oct 14, 2020

probably the DD just needs to set some env variables...

del os.environ['WORLD_SIZE']
# TODO: try this skip
# if 'WORLD_SIZE' in os.environ:
# del os.environ['WORLD_SIZE']
Copy link
Collaborator Author

@Borda Borda Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli it seems there is no restore anywhere else... so how is it possible it passes on master now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why the env deletion was added here? especially after training...

@pep8speaks
Copy link

pep8speaks commented Oct 14, 2020

Hello @Borda! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-26 05:32:38 UTC

@Borda Borda marked this pull request as draft October 14, 2020 20:02
@Borda
Copy link
Collaborator Author

Borda commented Oct 16, 2020

when I run the tests on similar machine...

=================================================================== short test summary info =====================================================================
FAILED tests/backends/test_ddp.py::test_multi_gpu_model_ddp_fit_only[--max_epochs 1 --gpus 2 --distributed_backend ddp] - FileNotFoundError: [Errno 2] No such ...
FAILED tests/backends/test_ddp.py::test_multi_gpu_model_ddp_test_only[--max_epochs 1 --gpus 2 --distributed_backend ddp] - FileNotFoundError: [Errno 2] No such...
FAILED tests/backends/test_ddp.py::test_multi_gpu_model_ddp_fit_test[--max_epochs 1 --gpus 2 --distributed_backend ddp] - FileNotFoundError: [Errno 2] No such ...
FAILED tests/models/test_amp.py::test_amp_with_apex - RuntimeError: CUDA error: no kernel image is available for execution on the device
FAILED tests/trainer/test_trainer.py::test_gradient_clipping_fp16 - AssertionError: Gradient norm != 1.0: nan
FAILED tests/trainer/optimization/test_manual_optimization.py::test_multiple_optimizers_manual_native_amp - RuntimeError: unscale_() has already been called on...
FAILED tests/trainer/optimization/test_manual_optimization.py::test_multiple_optimizers_manual_apex - RuntimeError: CUDA error: no kernel image is available fo...
======================================== 7 failed, 1351 passed, 19 skipped, 1 xfailed, 2544 warnings in 603.30s (0:10:03) ========================================

@mergify mergify bot requested a review from a team October 25, 2020 19:12
@mergify mergify bot requested a review from a team October 26, 2020 05:32
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Addition :)

@ydcjeff ydcjeff mentioned this pull request Oct 26, 2020
1 task
@SeanNaren SeanNaren merged commit ce8abd6 into master Oct 26, 2020
@SeanNaren SeanNaren deleted the ci/drone-pt1.6 branch October 26, 2020 10:47
@ydcjeff ydcjeff mentioned this pull request Oct 27, 2020
1 task
@Borda Borda mentioned this pull request Nov 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants