Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepspeed support #817

Closed
williamFalcon opened this issue Feb 11, 2020 · 29 comments · Fixed by #5954
Closed

Add deepspeed support #817

williamFalcon opened this issue Feb 11, 2020 · 29 comments · Fixed by #5954
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@williamFalcon
Copy link
Contributor

Let's support this!

https://github.com/microsoft/DeepSpeed

@williamFalcon williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on labels Feb 11, 2020
@sudarshan85
Copy link

Forgive me if I'm wrong, but doesn't Lightning already provide many functions supported by DeepSpeed? Also, going by a cursory reading of DeepSpeed, isn't it just another wrapper for Pytorch? Or am I wrong?

@williamFalcon
Copy link
Contributor Author

i haven’t had a chance to read in depth but this is likely a library that operates on top of models which means lightning can use it

@ghost
Copy link

ghost commented Feb 12, 2020

I think it's something like Lightning with more features based on the CIFAR example. One way is to make Lightning dependent on DeepSpeed for training related stuffs while Lightning focuses on reproducibility.

@williamFalcon
Copy link
Contributor Author

williamFalcon commented Feb 12, 2020

It's not like lightning at all lol. It's more like apex... or ddp.

To add support in lightning we need to create a flag and follow the readme instructions:

https://github.com/microsoft/DeepSpeed

Api

Create a deepspeed object for the configs.

Code changes

When the flag is enabled

Trainer(distributed_backend='deepspeed')

OR 
Trainer(backend_engine='deepspeed')

The trainer does the following:

1. Init model, optimizers (like amp)

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

2. do a slightly different forward (like ddp)

Note: We need to forward to training_step, validation_step and test_step accordingly. See DDP override.

for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

3. do a slightly different thing for checkpoint saving

        model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

4. 16-bit and ddp

We need to make sure when deepspeed is enabled to defer to the library so it can handle 16-bit and ddp.

5. set up config automatically

Since the trainer flags have most of what's needed, we can automatically set up the config for the user (https://github.com/microsoft/DeepSpeed#deepspeed-configuration).

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "disable_allgather": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

@neggert @jeffling anyone want to take this?
@luiscape might be a good issue to try?
@Borda also a good issue to start with

@williamFalcon williamFalcon added this to the 0.6.1 milestone Feb 12, 2020
@williamFalcon
Copy link
Contributor Author

williamFalcon commented Feb 12, 2020

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

@williamFalcon williamFalcon modified the milestones: 0.6.1, 0.7.0 Feb 12, 2020
@ghost
Copy link

ghost commented Feb 12, 2020

@williamFalcon okay my bad I didn't read through it properly. Btw, I think the Training Optimizers, Advanced Parameter Search and Simplified Data Loader seems like good features to be included into Lightning if DeepSpeed backend is used. Or is it better for user to manually call it using the DeepSpeed library?

@williamFalcon
Copy link
Contributor Author

williamFalcon commented Feb 12, 2020

@xingzhaolee these are all features we should automatically enable when someone uses the deepspeed backend.

We should also make that configurable so users can modify it if they want to:

def configure_deepspeed(self, ...):
   # do auto setup stuff for users

Then if you want a different way of doing this, override this function and add your own version.

@jeffra
Copy link

jeffra commented Feb 13, 2020

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

@williamFalcon Thanks for reaching out to us, this could be great. We are having internal discussions about how to proceed and will get back to you soon. We're also in the process of learning more about Lightning, it looks like great work you all have done :)

@Borda Borda modified the milestones: 0.6.1, 0.6.2 Feb 25, 2020
@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 3, 2020
@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020
@Borda Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 13, 2020
@Borda Borda removed this from the 0.7.7 milestone May 26, 2020
@SeanNaren
Copy link
Contributor

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):

1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume

It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:

model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
	accelerator=‘ddp’,
	num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
	accelerator=‘deepspeed’,
	num_gpus=2
)

trainer.fit(model) # Actually trains!

I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

@javismiles
Copy link

sounds great! looking forward to the v1 ;)

yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )

@blefaudeux
Copy link

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):

1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume

It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:

model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
	accelerator=‘ddp’,
	num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
	accelerator=‘deepspeed’,
	num_gpus=2
)

trainer.fit(model) # Actually trains!

I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

FYI we have an implementation of the optimizer side in https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/oss.py, compatible with standard pytorch (ie. same param groups for instance, so that schedulers don't see a change). The issue with any implementation is that to get the full benefits you need to change the way the DP engine works though, that's true of 1) and 2) above. If you keep the normal pytorch DDP then the gradients are all-reduced and you waste some traffic. cc @ananthsub

@SeanNaren
Copy link
Contributor

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

@williamFalcon
Copy link
Contributor Author

we could drop the v1 to use the non optimized version first? then quickly move to a v2 where we modify the ddp stuff as well?

we already have lightningddp which modifies the original ddp a bit.

@blefaudeux
Copy link

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

(with the standard DDP, using the linked optimizer as a drop-in "replacement" -more, wrap- to a normal optimizer) couple of percents if multi node, but that would depend on the interconnect. intra node it's actually faster on top of saving memory.

Now with more custom DDP like what deepspeed is doing there's a lot of potential in terms of speed and memory, but it's a little more compllicated to integrate, working on it. I can mention pytorch/pytorch#42849 and pytorch/pytorch#37002 here, ideally it would be nice to be able to customize the communication patterns without duplicating/forking

@SeanNaren
Copy link
Contributor

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

@blefaudeux
Copy link

blefaudeux commented Oct 14, 2020

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

You can save a bit more if you own the communications, because you can release the gradients as soon as they have been reduced to the appropriate rank, that's 2) basically. So 1) is drop-in (usable with normal DDP and you get some savings), you can get a 1.5 of sorts by releasing all the now-useless gradients at the beginning of the sharded optimizer step (that's what the fairscale implementation above does), and 2) is when you drop the gradients as soon as possible, earlier. example, toy problem training a RN101 on 4 gpus, first is DDP, second is OSS+DDP, third is OSS+custom DDP (the losses should be exactly the same, fixing that)

@SeanNaren
Copy link
Contributor

Thanks @blefaudeux! We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further. Out of curiosity has there been any progress integrating gradient/parameter partitioning?

@blefaudeux
Copy link

blefaudeux commented Oct 14, 2020

Thanks @blefaudeux!

of course !

We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further.

You might need to sync with Ananth (@ananthsub), within FB there's already a lightning/fairscale integration running, could be worth it unifying the efforts ?

Out of curiosity has there been any progress integrating gradient/parameter partitioning?

I've an experimental branch which gives these results currently (last chunk. 'OSS experimental'), following the ideas presented in this RFC (split the model in chunks, use autograd hooks to load/drop the parameters on the fly while keeping reasonably close to pytorch, each rank owns the optimization for one chunk only), very much WIP though. The savings depend a lot on the model size and optimizer, and with this test problem the activations dominate actually so it's not the most impressive usecase (still useful).

@SeanNaren
Copy link
Contributor

Just an update if anyone is tracking this, we technically haven't gotten 'DeepSpeed' support but that's primarily a design choice as the upstream API can be improved to not detriment user experience.

What this means is currently FairScale which has been integrated into lightning some time provides most of the features whilst being accessible to all lightning modules in different domains, and I highly suggest looking at our sharded training as a replacement. There are some exciting improvements coming up as well to continue the memory/speed efficiency in FairScale and from other integrations :)

@Spenhouet
Copy link

Is there also going to be support for ZeRO-Offload?

https://www.deepspeed.ai/tutorials/zero-offload/

Or does this also depend on FairScale implementing it? In case, I created a feature request here: facebookresearch/fairscale#337

@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
@SeanNaren SeanNaren mentioned this issue Feb 13, 2021
15 tasks
@edenlightning edenlightning modified the milestones: 1.3, 1.2 Feb 16, 2021
@SeanNaren
Copy link
Contributor

SeanNaren commented Feb 17, 2021

An update here! DeepSpeed finally has been integrated as a plugin into Lightning, see our docs here. We've worked hard to make the API flexible whilst reducing friction as much as possible.

If you see run into any problems please leave an issue or message us on our PyTorch Lightning slack channel!

Is there also going to be support for ZeRO-Offload?

There already is, and it's the staple feature with presets out the box, so you don't need to modify your code to use it (for most cases), we also give instructions to tune to reasonable parameters.

Currently the plugin is available from PyTorch Lightning master, but we'll be releasing 1.2 soon with the feature with technical details and benchmarks soon to come.

Currently the plugin does not support multiple optimizers, so you'll need to fallback on Sharded Training as we add this support onto DeepSpeed!

@williamFalcon
Copy link
Contributor Author

waaa

@jeffra
Copy link

jeffra commented Feb 17, 2021

An update here! DeepSpeed finally has been integrated as a plugin into Lightning, see our docs here. We've worked hard to make the API flexible whilst reducing friction as much as possible.

If you see run into any problems please leave an issue or message us on our PyTorch Lightning slack channel!

Is there also going to be support for ZeRO-Offload?

There already is, and it's the staple feature with presets out the box, so you don't need to modify your code to use it (for most cases), we also give instructions to tune to reasonable parameters.

Currently the plugin is available from PyTorch Lightning master, but we'll be releasing 1.2 soon with the feature with technical details and benchmarks soon to come.

I'd once again like to suggest that users first check out Sharded Training as this works out the box for more use cases and has complete Lightning Support, where we are still ironing out kinks with DeepSpeed. FairScale will be introducing some exciting features like ZeRO-offload whilst being PyTorch-compliable so keep an eye :)

This is super exciting!! :) thanks for your contributions to DeepSpeed and all your work getting DeepSpeed integrated into lightning! I think we have a chat coming up soon, we would love to hear more about any kinks you discovered with DeepSpeed and how we can iron those out together. Also especially curious on details regarding PyTorch incompatibilities with DeepSpeed?

@SeanNaren
Copy link
Contributor

So epic to see you here @jeffra you and your team has done amazing work, I can't wait to see what you guys come up with, and hope we can assist in your work! Thank you for your kind words but it's really your team that did all the work :)

I'll be hopefully pushing a few PRs to DeepSpeed/Lightning to ease integration (a lot of the issues are kinks we need to iron out on our end in PyTorch Lightning). These range from small issues like configuration of the throughput timer, to more involved changes like multi-optimizer/multi-scheduler support or allowing lightning to control Apex/AMP settings outside of DeepSpeed.

I'll be tracking these via issues so we can iterate on them and make the integration even better. I've been using DeepSpeed for a while now across a multitude of models and it's been an incredible experience with the level of customisability. I'm glad we're moving towards the community being able to fine-tune/train larger models!

@jeffra
Copy link

jeffra commented Feb 18, 2021

Thanks for the kind words @SeanNaren! :) Very much looking forward to a great collaboration going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
10 participants