Big science fix passing multiple tensors by thomasw21 · Pull Request #1400 · deepspeedai/DeepSpeed

thomasw21 · 2021-09-26T12:30:37Z

Fixes necessary for implementation of prefix lm:

Passing tuple of tensors between stages
Allowing models to pass bool tensors. (This seems fixed in torch.distributed for pytorch > 1.7)

…activation Avoid partitioning small activations

* removes repeated overflow log * pipe_replicated * _pipe_replicated -> ds_pipe_replicated * Adds send/recv fallback to bcast when torch version <= 1.8

…er (deepspeedai#1263) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Use mpu in DeepSpeedConfig() call * Improve argument naming

* FP16 fused and unfused grad norm query. * API for obtaining global unclipped gradient norm across parameter groups * Use global norm not group norms Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>

* restore fp16 params if no zero ckpts available * formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

stas00

Great work, Thomas!

I left a few small suggestions.

I think the only issue here is that this new code doesn't have a test, so unless we add one it's likely to break down the road.

I'm not an expert on DS test suite, I added a few tests following monkey-see-monkey-do but this one might be not so easy to add. Let's first see if this is in general acceptable by the DS team and then we can work on a test.

deepspeed/runtime/pipe/engine.py

stas00 · 2021-10-05T22:43:33Z

deepspeed/runtime/pipe/engine.py

should this not be a tuple? it could be just fine, I was just comparing with the original and it was there...

This is a tuple, in python you don't need to specifcy tuple, parenthesis work just fine.

I think it will be list.

>>> list_ = [1, 2, 3] >>> is_tuple = (list_) >>> type(is_tuple) <class 'list'>

the following line means type casting. parenthesis can not cast list into tuple.
so you need to remain keyword tuple.

inputs = ([part.to_meta(), part.data()]) # list inputs = tuple([part.to_meta(), part.data()]) # tuple

ah well played I didn't notice, which now begs the question why it works ...

EDIT: It actually doesn't matter, because you end up looping over the elements of an iterable. But I'll make the desired changes.

stas00 · 2021-10-05T22:57:01Z

@ShadenSmith, @jeffra - could you please have a look at the proposed changes - this is currently a blocker for our adding of prefix-lm to Meg-DS. Thank you!

stas00 · 2021-10-05T23:06:23Z

@thomasw21, also please run:

pre-commit run --all-files

to auto-format your code - as you can see the build CI failed.

pip install pre-commit

hyunwoongko · 2021-10-06T05:07:36Z

hyunwoongko · 2021-10-06T05:16:19Z

@thomasw21 I'll test your branch now. If it works, this will be a very great work.

hyunwoongko · 2021-10-06T05:28:07Z

@thomasw21 I got error from this line.

assert all([torch.is_tensor(elt) and elt.requires_grad is False for elt in outputs[1:]])

Why all the output tensors except for first one must not be grad tensor when tuple was output?

hyunwoongko · 2021-10-06T06:21:57Z

~~I think this assertion torch.is_tensor() came from big-science branch. why did you guys change like it?~~

407ff0f#diff-26611f6be759237464a03bb1328cbc16555888836b3504dc3703e2e25d2a3ca3

torch.is_tensor() came from this PR.

thomasw21 · 2021-10-06T06:25:16Z

Hey @hyunwoongko! Concerning the grad issue, i had two reasons:

it doesn't fit our current need. Typically we wanted something similar to GPT2ModelPipe hacks. But in my opinion this is a more generic solution.
I'm not super sure on how we'd want to handle that, typically with PartitionedTensor. There are a few tricks with tensors that require grads, which I'm a bit unfamiliar with. Typically the way grads are stored https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py#L595-L597 So ideally I guess we'd want a similar way of handling grads to reduce memory footprint no?

hyunwoongko · 2021-10-06T06:41:49Z

#1432

I made un issue to make it more accessible for others.
I think someone may have similar problems later on as we do.

…sing-multiple-tensors

thomasw21 · 2021-10-06T08:39:24Z

@stas00 Thanks ! I'll wait to get first general approval of the approach before writing tests I guess?

Something important to note is that passing bool tensors works because we use torch 1.7+ (there was an issue with bool tensors not working in distributed setting using nccl)

Discussion: Fix deepspeed prefix-lm bigscience-workshop/Megatron-DeepSpeed#107 (comment)
Related issue (fixed): [distributed] NCCL Backend doesn't support torch.bool data type pytorch/pytorch#24137

stas00 · 2021-10-06T17:19:27Z

@stas00 Thanks ! I'll wait to get first general approval of the approach before writing tests I guess?

Something important to note is that passing bool tensors works because we use torch 1.7+ (there was an issue with bool tensors not working in distributed setting using nccl)

perhaps then it should check that pytorch>=1.7 - I don't know if we could simply require that for the pipeline files?

@jeffra, what do you think? Do you have internal needs to support torch<1.7 for PP?

Additionally, could one of you please confirm that this line of changes is agreeable to you?

…of tensors

…:thomasw21/DeepSpeed into big-science-fix-passing-multiple-tensors

thomasw21 · 2021-10-07T02:27:52Z

I should have fixed all the failing tests:

pytest --forked unit/test_checkpointing.py::test_checkpoint_moe unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer unit/test_checkpointing.py::test_checkpoint_unique_tag unit/test_configurable_parallel.py::TestConfigurablePP::test_gpt2_mp2_pp_2to1 unit/test_configurable_parallel.py::TestConfigurablePP::test_gpt2_mp1_pp_2to1 unit/test_configurable_parallel.py::TestConfigurablePP::test_pp_basic unit/test_configurable_parallel.py::TestConfigurablePP::test_gpt2_pp_1to2_mp_1to2 unit/test_configurable_parallel.py::TestConfigurablePP::test_gpt2_mp2_pp_1to2 unit/test_configurable_parallel.py::TestConfigurablePP::test_gpt2_pp_2to1_mp_2to1 unit/test_configurable_parallel.py::TestConfigurableMP::test_gpt2_mp2_no_resize

stas00 · 2021-10-07T05:26:27Z

@jeffra, @tjruwase - could you please start CI? Thank you!

ShadenSmith

This is great, thanks a ton!

Shaden Smith and others added 30 commits June 6, 2021 11:27

Minor tweaks to support Megatron 2.4 + DS 3D

db017fd

pipe partitioning

407ff0f

re-enable grad buffer partitioning

a096d32

Avoid partitioning small activations

9b4093b

Merge pull request deepspeedai#4 from ShadenSmith/olruwase/partition_…

182be7b

…activation Avoid partitioning small activations

send/recv

3e948df

isend/irecv missing wait

b6a2cb3

turn off async ops

6bb63b8

Merge branch 'megatron2.4-3d-sendrecv' into megatron2.4-3d

8097690

less verbose load

bd9e953

Merge branch 'master' into megatron2.4-3d

081ddb5

added shaden's set_train_batch_size patches, plus formatting

d26c258

Adds engine.was_step_applied() (deepspeedai#1251)

9dbfdbd

Cleaning up tensor/pipe parallel accounting. (deepspeedai#1252)

d6945de

* removes repeated overflow log * pipe_replicated * _pipe_replicated -> ds_pipe_replicated * Adds send/recv fallback to bcast when torch version <= 1.8

Correctness fix PP+ZeRO for gradient accumulation + updates from mast…

f93e22b

…er (deepspeedai#1263) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

dont clear grads in stage 1 code path

e9b5dff

prevent none grads from being reduced

4b35409

fix empty grad zero tests

bc17042

Use mpu in DeepSpeedConfig() call (deepspeedai#1271)

6b42882

* Use mpu in DeepSpeedConfig() call * Improve argument naming

API for obtaining global gradient norm (deepspeedai#1292)

cce85b8

* FP16 fused and unfused grad norm query. * API for obtaining global unclipped gradient norm across parameter groups * Use global norm not group norms Co-authored-by: Shaden Smith <shaden.smith@microsoft.com>

turn excessive noise off (deepspeedai#1293)

e65e511

[zero] restore fp16 params if no zero ckpts available (deepspeedai#1322)

db2f8a0

* restore fp16 params if no zero ckpts available * formatting

Fix PP checkpoint bloat (deepspeedai#1324)

72ce55a

update for cuda-11.4 (deepspeedai#1329)

c7f3bc5

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Try something out

ddaa406

Woops

b57a10b

Make deepspeed pass any types of dtypes between stages

a7cca98

Woops

2c5d1e4

Woops 2

d6f7b00

Woops 3

33e2471

stas00 approved these changes Oct 5, 2021

View reviewed changes

thomasw21 added 2 commits October 6, 2021 10:08

Run pre-commit

456c49a

Use ValueError + error msg instead of NotImplemetedError

c187645

thomasw21 changed the base branch from big-science to master October 6, 2021 08:11

thomasw21 changed the base branch from master to big-science October 6, 2021 08:26

Merge remote-tracking branch 'origin/master' into big-science-fix-pas…

48dc913

…sing-multiple-tensors

thomasw21 changed the base branch from big-science to master October 6, 2021 08:32

thomasw21 and others added 3 commits October 6, 2021 12:08

Use tuples instead of lists

7158a21

Merge branch 'master' into big-science-fix-passing-multiple-tensors

e2c875a

Merge branch 'master' into big-science-fix-passing-multiple-tensors

0d8daac

thomasw21 added 3 commits October 7, 2021 01:37

Make sure to set as input a tensor when required, instead of a tuple …

673a326

…of tensors

Merge branch 'big-science-fix-passing-multiple-tensors' of github.com…

1c3dee5

…:thomasw21/DeepSpeed into big-science-fix-passing-multiple-tensors

Update inputs as well

8060021

Merge branch 'master' into big-science-fix-passing-multiple-tensors

19de4aa

ShadenSmith approved these changes Oct 7, 2021

View reviewed changes

jeffra merged commit 9c67278 into deepspeedai:master Oct 7, 2021

conglongli mentioned this pull request Oct 21, 2021

fix pipeline engine #1473

Merged

Quentin-Anthony mentioned this pull request Nov 22, 2022

Remove PP Grad Tail Check #2538

Merged

Conversation

thomasw21 commented Sep 26, 2021

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stas00 Oct 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

hyunwoongko Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Oct 5, 2021

Uh oh!

stas00 commented Oct 5, 2021

Uh oh!

hyunwoongko commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyunwoongko commented Oct 6, 2021

Uh oh!

hyunwoongko commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyunwoongko commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyunwoongko commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Oct 6, 2021

Uh oh!

stas00 commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Oct 7, 2021

Uh oh!

stas00 commented Oct 7, 2021

Uh oh!

ShadenSmith left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

stas00 Oct 5, 2021 •

edited

Loading

hyunwoongko Oct 6, 2021 •

edited

Loading

thomasw21 Oct 6, 2021 •

edited

Loading

hyunwoongko commented Oct 6, 2021 •

edited

Loading

hyunwoongko commented Oct 6, 2021 •

edited

Loading

hyunwoongko commented Oct 6, 2021 •

edited

Loading

thomasw21 commented Oct 6, 2021 •

edited

Loading

hyunwoongko commented Oct 6, 2021 •

edited

Loading

stas00 commented Oct 6, 2021 •

edited

Loading