Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ddp plugin test fix #3

Merged
merged 71 commits into from
Apr 10, 2021
Merged

Ddp plugin test fix #3

merged 71 commits into from
Apr 10, 2021

Conversation

shuyingsunshine21
Copy link
Owner

merge ddp plugin test fix

ArvinZhuang and others added 30 commits March 25, 2021 15:07
…I#6654)

* Fix checkpoint callback issue for TPUs

* update changelog

* add barrier

* apply code suggestions

* update trainer test

* remove spaces

* fix tpu tests

* Apply suggestions from code review

* add comment

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
… of train/val/test (Lightning-AI#6498)

* update docs

* add hook and update docs

* update tests

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* use external deprecate

* simplify

* simplify

* simplify

* flake8

* .

* others

* .
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Add artifcact_location arg to MLFlow logger

* Add CHANGELOG URL

* Update test
…#6417)

* add warning non reduced

* add test

* update test

* update changelog

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* update

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
* use latest

* remake

* examples
* support python 3.9

* update CI

* onnxruntime

* .

* .

* onnxruntime

* t 55

* t 75

* add script

* use

* onnx

* onnx

* onnx

* whl

* np

* find

* 21

* Apply suggestions from code review

* Apply suggestions from code review

* onnx

* CI

* req

* ~ dockers

* min

* .

* drop horovod

* drop horovod

* drop horovod

* fix

* fix

* .
…ightning-AI#6719)

* update_logic

* update

* Update tests/utilities/test_xla_device_utils.py

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* update test

* Update tests/utilities/test_xla_device_utils.py

* update

* Apply fix

* Docstring

* flake8

* update

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
…ng-AI#6689)

* move save_checkpoint responsability to accelerator

* update
* Add base hook for model parallel

* fix callback signature

* Simplify hook

* Add hook logic

* add tests

* add property setter

* add logic for being called once

* Update changelog

* Fix

* fix return type

* fix lambda callback test

* Fix tests

* Apply code suggestions

* add logic for setup_optimizers_predispatch

* add common dummy model

* Swap call order

* Remove test that isn't needed anymore

* Update tests

* Add a bit more doc

* Few code review fixes

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Change hook name

* Fix test

* Test setup hook, refactor names

* Swap call order of callbacks and model initialization

* Change name of context manager

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* update chlog v1.2.5

* legacy
* fix_hydra

* update changelog

Co-authored-by: Your Name <you@example.com>
* Update Bolts link

* Update Bolts link

* formt

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
* Added base docs

* Add more information

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Add 1.2.6 sections to CHANGELOG

* Update CHANGELOG.md

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
* Update logic for checking TPUs availability

* fix flake8

* add fix
THasthika and others added 28 commits April 6, 2021 11:37
There seem to be 3 arguments missing in the `lr_find` call in the tunining.py file.
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
…ghtningCLI` (Lightning-AI#4492)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
)

* add changelog

* add clip by value

* fix bug in training tricks.rst

* fix bug in trainer.rst

* Update trainer.rst

* Update trainer.rst

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/precision/deepspeed_precision.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/utilities/enums.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* yapf formatting

* update training tricks

* update based on comment

* update based on comment

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* update based on comment

* pep8

* mypy

* mypy

* Update docs/source/advanced/training_tricks.rst

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update sharded_native_amp.py

* Update test_sharded_parity.py

* update test codes

* Update test_tpu.py

* Update pytorch_lightning/trainer/connectors/training_trick_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update test_trainer.py

* Update enums.py

* Update enums.py

* add super-class initialization to precision plugins.

* add clip_grad horovod cpu test

* add clip_grad horovod cpu test

* use subprocess check_call

* change order of horovod tests

* set max_epochs 2 in horovod test

* remove clip_grad_val test from horovod-cpu

* remove "type: ignore"

* divide clip grad val test in horovod

* update based on comments

* add super-class initialization to precision plugins.

* bugfix

* bugfix

* revert some changes

* revert some changes

* Update tests/models/test_horovod.py

* merge master

* Delete signature test

No point in testing a signature

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
* Update seed.py

* Update pytorch_lightning/utilities/seed.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update seed.py

* Update seed.py

* Update seed.py

Co-authored-by: thomas chaton <thomas@grid.ai>
…astic (Lightning-AI#6802)

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b8.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da, reversing
changes made to 0d23d75.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75.

* Revert "Update utils.py"

This reverts commit 70fe5da.

* Revert "Update utils.py"

This reverts commit a9aae99.

* Revert "Update test_results.py"

This reverts commit ea74906.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e43.

* Revert "Update ddp_spawn.py"

This reverts commit f172101.

* Revert "checkpoint consolidation"

This reverts commit 536c132.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde9.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f4.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc9.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aa.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29.

* modify distributed environment to make test pass

* add DDP communication hook

* remove test related setting

* remove more test related setting

* fix ddp comm hook util import issue

* comments

* one more fix for test_custom_plugin

* fix ddp spwan

* fix sgd

* address comments and add tests

* 1. add is gpu checking 2. modify test a bit 3. formatting

* formatting nit

* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module

* need at least 1.8.0

* minor fix

* modify changelog

* changelog should link to PR number instead of issue number

* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge

* move single device checking before call register_ddp_comm_hook

* formatting

* comments

* typo

* pre-commit formatting
* Update Changelog for v1.2.7

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
* Update mlflow.py

Lightning-AI#6745 adds additional info about the run, as in the native API

* Update mlflow.py

trying to fix some backward compatibility issues with `resolve_tags`

* wip on backward compatibility

added a default for `getattr` in case the `registry` object exists, but has no proper attribute (weird case but who knows...)

* fix pep

* impoert

* fix registry import

* try fix failing tests

removed the first if statement, so that `resolve_tags` would be defined either case

* fix formatting

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Lightning-AI#6877)

* Ensure we move the model to eval mode before running evaluation

* Ensure we set the flag appropriately across all stages

* Add test, move hooks logic

* Apply same fix to the validate loop

* Update pytorch_lightning/trainer/trainer.py

* Fix function name

* Fix order, add predict

* Shorten the name

* Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function

* Use hook, remove double call
@shuyingsunshine21 shuyingsunshine21 merged commit e274758 into master Apr 10, 2021
@shuyingsunshine21 shuyingsunshine21 deleted the ddp_plugin_test_fix branch April 10, 2021 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.