Skip to content

Comments

[squash] Big science v1#1406

Merged
jeffra merged 41 commits intobig-science-v2from
big-science
Sep 27, 2021
Merged

[squash] Big science v1#1406
jeffra merged 41 commits intobig-science-v2from
big-science

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Sep 27, 2021

squash all big science changes into single commit for easier re-basing

Shaden Smith and others added 30 commits June 6, 2021 11:27
* unit test for bugfix #1135

* formatter

* fix test in presence of mpi4py

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
… VC & cuda build tool (#1151)

* Add Windows support in README, use c++17 on Windows to support latest vc build tool

* Add detailed cpp build tools version in README

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Add `import os` to inference tutorials

* assign deepspeed-initialized model to hf model
* largest_partitioned_params calculation fix

largest partitioned params was getting calculated incorrectly

* Update stage3.py

* Update stage3.py

* formatting fix

* changing sub-group size default to 1e9

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

* Fix print_per_steps doc
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Fix bugs about non-contiguous tensor broadcasting

* Fix typo

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
* undo noise

* another
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Shaden Smith and others added 11 commits July 26, 2021 08:52
* removes repeated overflow log

* pipe_replicated

* _pipe_replicated -> ds_pipe_replicated

* Adds send/recv fallback to bcast when torch version <= 1.8
…er (#1263)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Use mpu in DeepSpeedConfig() call

* Improve argument naming
* FP16 fused and unfused grad norm query.

* API for obtaining global unclipped gradient norm across parameter groups

* Use global norm not group norms

Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>
* restore fp16 params if no zero ckpts available

* formatting
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
@jeffra jeffra merged this pull request into big-science-v2 Sep 27, 2021
jeffra added a commit that referenced this pull request Sep 28, 2021
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants