Distributed training with gradient accumulation #5100

dirkgr · 2021-04-07T01:25:26Z

No description provided.

epwalsh

Woo! Great catch!

…thGradAcc

dirkgr · 2021-04-08T01:47:44Z

Something is wrong with the GPU runners. I bounced them both.

* Fixes distributed training with gradient accumulation * Fix in case we don't do anything in a batch group * Test for the problematic condition * Formatting * More formatting * Changelog * Fix another test * Fix even more tests * Fixes one more test * I can fix these tests all day.

@matt-gardner

* Formatting * New activation functions * Makes position embeddings optional in the transformer embeddings * Adds T5 * Various fixes to make this start up * Share weights * Adds one test that passes, and one test that fails * use min_value_of_dtype in apply_mask * fixes, add beam search * encoder fixes * fix * fix beam search * fix tests * rename to just 'T5' * fix initialization from pretrained * add Model, DatasetReader, and Predictor * remove useless dataset reader * move high-level peices to allennlp-models * revert predictor changes * remove unneeded hidden_size * remove stray comment * bool masks * CHANGELOG * fix test file name * revert other change * revert other change * Distributed training with gradient accumulation (#5100) * Fixes distributed training with gradient accumulation * Fix in case we don't do anything in a batch group * Test for the problematic condition * Formatting * More formatting * Changelog * Fix another test * Fix even more tests * Fixes one more test * I can fix these tests all day. * Add link to gallery and demo in README (#5103) * Add link to gallery in README * Update README.md * try emojis Is this overkill? * Adding a metadata field to the basic classifier (#5104) * Adding metadata parameter to BasicClassifier * Fix * Updating the changelog * reformatting * updating parameter type * fixing import Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * additional W&B params (#5114) * additional W&B params * add wandb_kwargs * fix * fix docs * Add eval_mode argument to pretrained transformer embedder (#5111) * Add eval_mode argument to pretrained transformer embedder * Edit changelog entry * Lint * Update allennlp/modules/token_embedders/pretrained_transformer_embedder.py * Apply suggestions from code review Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * specify 'truncation' to avoid transformers warning (#5120) * specify 'truncation' to avoid transformers warning * Update docs * Remove `stride` param * Update CHANGELOG.md Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Predicting with a dataset reader on a multitask model (#5115) * Create a way to use allennlp predict with a dataset and a multitask model * Fix type ignoration * Changelog * Fix to the predictor * fix bug with interleaving dataset reader (#5122) * fix bug with interleaving dataset reader * more tests * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * remove jsonpickle from dependencies (#5121) Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Update docstring for basic_classifier (#5124) * improve error message from Registrable class (#5125) Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> * Prepare for release v2.3.0 * fix docs CI * Take the number of runs in the test for distributed metrics (#5127) * Take the number of runs in the test for distributed metrics * Changelog * Add influence functions to interpret module (#4988) * creating a new functionality to fields and instances to support outputing instnaces to json files * creating tests for the new functionality * fixing docs * Delete __init__.py * Delete influence_interpreter.py * Delete use_if.py * Delete simple_influence_test.py * fixing docs * finishing up SimpleInfluence * passing lint * passing format * making small progress in coding * Delete fast_influence.py Submit to the wrong branch * Delete faiss_utils.py wrong branch * Delete gpt2_bug.py not sure why it's included * Delete text_class.py not sure why it's included * adding test file * adding testing files * deleted unwanted files * deleted unwanted files and rearrange test files * small bug * adjust function call to save instance in json * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * move some documentation of parameters to base class * delete one comment * delete one deprecated abstract method * changing interface * formatting * formatting err * passing mypy * passing mypy * passing mypy * passing mypy * passing integration test * passing integration test * adding a new option to the do-all function * modifying the callable function to the interface * update API, fixes * doc fixes * add `from_path` and `from_archive` methods * fix docs, improve logging * add test * address @matt-gardner's comments * fixes to documentation * update docs Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * Update CONTRIBUTING.md (#5133) * Update CONTRIBUTING.md * updated changelog Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> * fix #5132 (#5134) * fix * Prepare for release v2.3.1 * Fairness Metrics (#5093) * Added three definitions of fairness * Updated CHANGELOG * Added DemographicParityWithoutGroundTruth and finished tests * finished refactoring Independence, Separation, and Sufficiency to accumulate * added distributed functionality to Independence, Sufficiency, and Separation * Finished aggregate and distributed functionality for DemographicParityWithoutGroundTruth * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU issues * fixed GPU issues * added init file * fixed typo * minor docstring changes * minor changes to docstring * Added simple explanations of fairness metrics to docstrings * Further vectorized all metric implementations * Fixed device issue Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * fix cached_path for hub downloads (#5141) * fix cached_path for hub downloads * fix test name * fix type hint * Update allennlp/common/file_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * fix * fix Co-authored-by: epwalsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> Co-authored-by: Jacob Morrison <jacob1morrison@gmail.com> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Leo Liu <zeyuliu2@uw.edu> Co-authored-by: ArjunSubramonian <arjun.subramonian@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Fixes distributed training with gradient accumulation * Fix in case we don't do anything in a batch group * Test for the problematic condition * Formatting * More formatting * Changelog * Fix another test * Fix even more tests * Fixes one more test * I can fix these tests all day.

@matt-gardner

* Formatting * New activation functions * Makes position embeddings optional in the transformer embeddings * Adds T5 * Various fixes to make this start up * Share weights * Adds one test that passes, and one test that fails * use min_value_of_dtype in apply_mask * fixes, add beam search * encoder fixes * fix * fix beam search * fix tests * rename to just 'T5' * fix initialization from pretrained * add Model, DatasetReader, and Predictor * remove useless dataset reader * move high-level peices to allennlp-models * revert predictor changes * remove unneeded hidden_size * remove stray comment * bool masks * CHANGELOG * fix test file name * revert other change * revert other change * Distributed training with gradient accumulation (#5100) * Fixes distributed training with gradient accumulation * Fix in case we don't do anything in a batch group * Test for the problematic condition * Formatting * More formatting * Changelog * Fix another test * Fix even more tests * Fixes one more test * I can fix these tests all day. * Add link to gallery and demo in README (#5103) * Add link to gallery in README * Update README.md * try emojis Is this overkill? * Adding a metadata field to the basic classifier (#5104) * Adding metadata parameter to BasicClassifier * Fix * Updating the changelog * reformatting * updating parameter type * fixing import Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * additional W&B params (#5114) * additional W&B params * add wandb_kwargs * fix * fix docs * Add eval_mode argument to pretrained transformer embedder (#5111) * Add eval_mode argument to pretrained transformer embedder * Edit changelog entry * Lint * Update allennlp/modules/token_embedders/pretrained_transformer_embedder.py * Apply suggestions from code review Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * specify 'truncation' to avoid transformers warning (#5120) * specify 'truncation' to avoid transformers warning * Update docs * Remove `stride` param * Update CHANGELOG.md Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Predicting with a dataset reader on a multitask model (#5115) * Create a way to use allennlp predict with a dataset and a multitask model * Fix type ignoration * Changelog * Fix to the predictor * fix bug with interleaving dataset reader (#5122) * fix bug with interleaving dataset reader * more tests * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * remove jsonpickle from dependencies (#5121) Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Update docstring for basic_classifier (#5124) * improve error message from Registrable class (#5125) Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> * Prepare for release v2.3.0 * fix docs CI * Take the number of runs in the test for distributed metrics (#5127) * Take the number of runs in the test for distributed metrics * Changelog * Add influence functions to interpret module (#4988) * creating a new functionality to fields and instances to support outputing instnaces to json files * creating tests for the new functionality * fixing docs * Delete __init__.py * Delete influence_interpreter.py * Delete use_if.py * Delete simple_influence_test.py * fixing docs * finishing up SimpleInfluence * passing lint * passing format * making small progress in coding * Delete fast_influence.py Submit to the wrong branch * Delete faiss_utils.py wrong branch * Delete gpt2_bug.py not sure why it's included * Delete text_class.py not sure why it's included * adding test file * adding testing files * deleted unwanted files * deleted unwanted files and rearrange test files * small bug * adjust function call to save instance in json * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * move some documentation of parameters to base class * delete one comment * delete one deprecated abstract method * changing interface * formatting * formatting err * passing mypy * passing mypy * passing mypy * passing mypy * passing integration test * passing integration test * adding a new option to the do-all function * modifying the callable function to the interface * update API, fixes * doc fixes * add `from_path` and `from_archive` methods * fix docs, improve logging * add test * address @matt-gardner's comments * fixes to documentation * update docs Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * Update CONTRIBUTING.md (#5133) * Update CONTRIBUTING.md * updated changelog Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> * fix #5132 (#5134) * fix * Prepare for release v2.3.1 * Fairness Metrics (#5093) * Added three definitions of fairness * Updated CHANGELOG * Added DemographicParityWithoutGroundTruth and finished tests * finished refactoring Independence, Separation, and Sufficiency to accumulate * added distributed functionality to Independence, Sufficiency, and Separation * Finished aggregate and distributed functionality for DemographicParityWithoutGroundTruth * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU issues * fixed GPU issues * added init file * fixed typo * minor docstring changes * minor changes to docstring * Added simple explanations of fairness metrics to docstrings * Further vectorized all metric implementations * Fixed device issue Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * fix cached_path for hub downloads (#5141) * fix cached_path for hub downloads * fix test name * fix type hint * Update allennlp/common/file_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * fix * fix Co-authored-by: epwalsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> Co-authored-by: Jacob Morrison <jacob1morrison@gmail.com> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Leo Liu <zeyuliu2@uw.edu> Co-authored-by: ArjunSubramonian <arjun.subramonian@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

dirkgr added 8 commits April 6, 2021 17:31

Fixes distributed training with gradient accumulation

080ba43

Fix in case we don't do anything in a batch group

ac3b6db

Test for the problematic condition

cd2206d

Formatting

126dec4

More formatting

50c3322

Changelog

427ea4a

Fix another test

bcdb8e8

Fix even more tests

2877e66

dirkgr marked this pull request as ready for review April 7, 2021 02:08

dirkgr requested a review from epwalsh April 7, 2021 02:08

dirkgr self-assigned this Apr 7, 2021

dirkgr mentioned this pull request Apr 7, 2021

Multi-GPU training hangs #5088

Closed

10 tasks

Fixes one more test

b7ce8d8

epwalsh approved these changes Apr 7, 2021

View reviewed changes

dirkgr added 2 commits April 7, 2021 17:26

I can fix these tests all day.

d5de432

Merge remote-tracking branch 'origin/main' into DistributedTrainingWi…

65a0b0e

…thGradAcc

dirkgr enabled auto-merge (squash) April 8, 2021 00:40

This test takes too long on GPU

72af021

dirkgr merged commit de61100 into main Apr 8, 2021

dirkgr deleted the DistributedTrainingWithGradAcc branch April 8, 2021 01:57

leonweber added a commit to leonweber/allennlp that referenced this pull request May 10, 2022

Backport allenai#5100

8f32dbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training with gradient accumulation #5100

Distributed training with gradient accumulation #5100

dirkgr commented Apr 7, 2021

epwalsh left a comment

dirkgr commented Apr 8, 2021

Distributed training with gradient accumulation #5100

Distributed training with gradient accumulation #5100

Conversation

dirkgr commented Apr 7, 2021

epwalsh left a comment

Choose a reason for hiding this comment

dirkgr commented Apr 8, 2021