Ddp plugin test fix #3

shuyingsunshine21 · 2021-04-10T02:01:02Z

merge ddp plugin test fix

…I#6654) * Fix checkpoint callback issue for TPUs * update changelog * add barrier * apply code suggestions * update trainer test * remove spaces * fix tpu tests * Apply suggestions from code review * add comment Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>

… of train/val/test (Lightning-AI#6498) * update docs * add hook and update docs * update tests * chlog * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * chlog Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* use external deprecate * simplify * simplify * simplify * flake8 * . * others * .

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Add artifcact_location arg to MLFlow logger * Add CHANGELOG URL * Update test

…#6417) * add warning non reduced * add test * update test * update changelog * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * update Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* use latest * remake * examples

…ightning-AI#6667) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* support python 3.9 * update CI * onnxruntime * . * . * onnxruntime * t 55 * t 75 * add script * use * onnx * onnx * onnx * whl * np * find * 21 * Apply suggestions from code review * Apply suggestions from code review * onnx * CI * req * ~ dockers * min * . * drop horovod * drop horovod * drop horovod * fix * fix * .

…ightning-AI#6719) * update_logic * update * Update tests/utilities/test_xla_device_utils.py * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * update test * Update tests/utilities/test_xla_device_utils.py * update * Apply fix * Docstring * flake8 * update Co-authored-by: Your Name <you@example.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

…ng-AI#6689) * move save_checkpoint responsability to accelerator * update

* Add base hook for model parallel * fix callback signature * Simplify hook * Add hook logic * add tests * add property setter * add logic for being called once * Update changelog * Fix * fix return type * fix lambda callback test * Fix tests * Apply code suggestions * add logic for setup_optimizers_predispatch * add common dummy model * Swap call order * Remove test that isn't needed anymore * Update tests * Add a bit more doc * Few code review fixes * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Change hook name * Fix test * Test setup hook, refactor names * Swap call order of callbacks and model initialization * Change name of context manager Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update chlog v1.2.5 * legacy

* fix_hydra * update changelog Co-authored-by: Your Name <you@example.com>

* Update Bolts link * Update Bolts link * formt Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* Added base docs * Add more information * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

… returns (Lightning-AI#6734)

* Add 1.2.6 sections to CHANGELOG * Update CHANGELOG.md * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

)

* Update logic for checking TPUs availability * fix flake8 * add fix

There seem to be 3 arguments missing in the `lr_find` call in the tunining.py file.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

…ghtningCLI` (Lightning-AI#4492) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

) * add changelog * add clip by value * fix bug in training tricks.rst * fix bug in trainer.rst * Update trainer.rst * Update trainer.rst * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/precision/deepspeed_precision.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/utilities/enums.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * yapf formatting * update training tricks * update based on comment * update based on comment * Update pytorch_lightning/trainer/trainer.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * update based on comment * pep8 * mypy * mypy * Update docs/source/advanced/training_tricks.rst Co-authored-by: thomas chaton <thomas@grid.ai> * Update sharded_native_amp.py * Update test_sharded_parity.py * update test codes * Update test_tpu.py * Update pytorch_lightning/trainer/connectors/training_trick_connector.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update test_trainer.py * Update enums.py * Update enums.py * add super-class initialization to precision plugins. * add clip_grad horovod cpu test * add clip_grad horovod cpu test * use subprocess check_call * change order of horovod tests * set max_epochs 2 in horovod test * remove clip_grad_val test from horovod-cpu * remove "type: ignore" * divide clip grad val test in horovod * update based on comments * add super-class initialization to precision plugins. * bugfix * bugfix * revert some changes * revert some changes * Update tests/models/test_horovod.py * merge master * Delete signature test No point in testing a signature Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* Update seed.py * Update pytorch_lightning/utilities/seed.py Co-authored-by: thomas chaton <thomas@grid.ai> * Update seed.py * Update seed.py * Update seed.py Co-authored-by: thomas chaton <thomas@grid.ai>

…astic (Lightning-AI#6802) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit 9d4a2b8. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75. * Revert "Update test_all_gather_grad.py" This reverts commit 0d23d75. * Revert "Update utils.py" This reverts commit 70fe5da. * Revert "Update utils.py" This reverts commit a9aae99. * Revert "Update test_results.py" This reverts commit ea74906. * Revert "Update test_metric_result_integration.py" This reverts commit bf70e43. * Revert "Update ddp_spawn.py" This reverts commit f172101. * Revert "checkpoint consolidation" This reverts commit 536c132. * Revert "Revert "checkpoint consolidation"" This reverts commit 3a9fde9. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit 7a369f4. * Revert "Revert "Update ddp_spawn.py"" This reverts commit 8222dc9. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit 6c095b2. * Revert "Revert "Update test_results.py"" This reverts commit 250d0aa. * Revert "Revert "Update utils.py"" This reverts commit 8651d54. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit dcdcd29. * modify distributed environment to make test pass * add DDP communication hook * remove test related setting * remove more test related setting * fix ddp comm hook util import issue * comments * one more fix for test_custom_plugin * fix ddp spwan * fix sgd * address comments and add tests * 1. add is gpu checking 2. modify test a bit 3. formatting * formatting nit * fix conda 3.7 1.7 issue for no torch.distributed.algorithms module * need at least 1.8.0 * minor fix * modify changelog * changelog should link to PR number instead of issue number * refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge * move single device checking before call register_ddp_comm_hook * formatting * comments * typo * pre-commit formatting

* Update Changelog for v1.2.7 * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

…ghtning-AI#6878)

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* Update mlflow.py Lightning-AI#6745 adds additional info about the run, as in the native API * Update mlflow.py trying to fix some backward compatibility issues with `resolve_tags` * wip on backward compatibility added a default for `getattr` in case the `registry` object exists, but has no proper attribute (weird case but who knows...) * fix pep * impoert * fix registry import * try fix failing tests removed the first if statement, so that `resolve_tags` would be defined either case * fix formatting Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

prepare v1.3.0rc

Lightning-AI#6877) * Ensure we move the model to eval mode before running evaluation * Ensure we set the flag appropriately across all stages * Add test, move hooks logic * Apply same fix to the validate loop * Update pytorch_lightning/trainer/trainer.py * Fix function name * Fix order, add predict * Shorten the name * Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function * Use hook, remove double call

…lightning into ddp_plugin_test_fix

ArvinZhuang and others added 30 commits March 25, 2021 15:07

Match the number of outputs of backward with forward for AllGatherGrad (

b8ef52b

Lightning-AI#6625)

Update CODEOWNERS (Lightning-AI#6220)

92a1671

Support teardown hook on DataModule (Lightning-AI#4673)

40976e4

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>

Simplify deprecations (Lightning-AI#6620)

217c12a

* use external deprecate * simplify * simplify * simplify * flake8 * . * others * .

Resolve schedule step bug for PyTorch Profiler (Lightning-AI#6674)

0ea8f39

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Add artifcact_location arg to MLFlow logger (Lightning-AI#6677)

6b990f3

* Add artifcact_location arg to MLFlow logger * Add CHANGELOG URL * Update test

Do not add return dict items to callback_metrics (Lightning-AI#6682)

bc61361

Do not describe when there's no summary (Lightning-AI#6681)

b730a5a

Automatically find and run special tests (Lightning-AI#6669)

21fc5eb

Remove legacy Result parameters (Lightning-AI#6016)

f0c5479

remake nvidia docker (Lightning-AI#6686)

dcf6e4e

* use latest * remake * examples

More explicit exception message when testing with fast_dev_run=True (L…

cca0eca

…ightning-AI#6667) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

[refactor] Move save_function to accelerator 1/n [DeepSpeed] (Lightni…

646cf2f

…ng-AI#6689) * move save_checkpoint responsability to accelerator * update

update readme by v1.2.x (Lightning-AI#6728)

3c86193

Remove logger_connector legacy code (Lightning-AI#6733)

9044470

update chlog v1.2.5 (Lightning-AI#6742)

583fcf2

* update chlog v1.2.5 * legacy

[bugfix] Add support for omegaconf and tpu (Lightning-AI#6741)

bb92754

* fix_hydra * update changelog Co-authored-by: Your Name <you@example.com>

[docs] Update Bolts link (Lightning-AI#6743)

9876df1

* Update Bolts link * Update Bolts link * formt Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

DeepSpeed ZeRO Docs update (Lightning-AI#6752)

f9bb7c6

* Added base docs * Add more information * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Remove legacy support for the magic log/progress_bar keys in dict…

0dd2dee

… returns (Lightning-AI#6734)

Add 1.2.6 section to CHANGELOG (Lightning-AI#6732)

495c385

* Add 1.2.6 sections to CHANGELOG * Update CHANGELOG.md * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

Update clip gradients signature for precision plugins (Lightning-AI#6764

a72a799

)

Update logic for checking TPUs availability (Lightning-AI#6767)

13f67ad

* Update logic for checking TPUs availability * fix flake8 * add fix

THasthika and others added 28 commits April 6, 2021 11:37

Fixed missing arguments in lr_find call (Lightning-AI#6784)

f581411

There seem to be 3 arguments missing in the `lr_find` call in the tunining.py file.

Fix EarlyStopping logic when min_epochs not met (Lightning-AI#6705)

127c52a

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Simple reproducibility with minimum boilerplate CLI training with `Li…

b7f3a3c

…ghtningCLI` (Lightning-AI#4492) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Update sync_dist warning for multiple processes (Lightning-AI#6790)

a17c027

CI: fixture for global rank variable reset (Lightning-AI#6839)

b7a22ba

Update seed_everything() (Lightning-AI#6843)

a2c6057

* Update seed.py * Update pytorch_lightning/utilities/seed.py Co-authored-by: thomas chaton <thomas@grid.ai> * Update seed.py * Update seed.py * Update seed.py Co-authored-by: thomas chaton <thomas@grid.ai>

[fix] Better support for rank_zero_only setting for SLURM and torchel…

86e1d9f

…astic (Lightning-AI#6802) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Docs fixes (Lightning-AI#6870)

19e67d1

Update Changelog for v1.2.7 (Lightning-AI#6874)

9fbe724

* Update Changelog for v1.2.7 * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

Fix csv extension check (Lightning-AI#6436)

01b9cf8

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

Add separators to performance docs (Lightning-AI#6882)

128f6ab

Remove hardcoding of rank_zero_only.rank in accelerator connector (Li…

968ac09

…ghtning-AI#6878)

Fix finetuning complex models correctly unfreezes. (Lightning-AI#6880)

eb15abc

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

Fix DDP_SPAWN compatibility with bug_report_model.py (Lightning-AI#6892)

87f0aea

TPUSpawn + IterableDataset error message (Lightning-AI#6875)

1c2ecbf

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Merge pull request Lightning-AI#6885 from PyTorchLightning/v1.3.0rc

851fd7f

prepare v1.3.0rc

rebase

179d47e

fix exception raising (Lightning-AI#6901)

90e37ba

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

b461e44

…lightning into ddp_plugin_test_fix

fix version for ddp plugin test

e1bbc4d

fix

8270d0d

fix

803d5dd

changelog

ce1a19b

Update CHANGELOG.md

c6a13be

shuyingsunshine21 merged commit e274758 into master Apr 10, 2021

shuyingsunshine21 deleted the ddp_plugin_test_fix branch April 10, 2021 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ddp plugin test fix #3

Ddp plugin test fix #3

shuyingsunshine21 commented Apr 10, 2021

Ddp plugin test fix #3

Ddp plugin test fix #3

Conversation

shuyingsunshine21 commented Apr 10, 2021