self-balancing architecture #50

williamFalcon · 2019-08-06T22:39:18Z

This is a really awesome feature we're looking to add. Super hard problem also if any ninjas want to try to tackle it :) (you'll be legendary haha).

Problem:
Some models are too big to fit in memory. Thus can't do any distributed training currently available (even in PyTorch).

But... we can break up the model and put parts on each GPU. The trick though is to do it automatically, because manually doing this is a PITA (trust me, i spend weeks dealing with this haha).

Proposed solution:
User hook in LightningModule where user returns the modules they want balanced.

class MyModule(LightningModule):
    def __init__(...):
        self.model_a = SomeModel()
        self.layer_1 = Linear(...)
        self.layer2 = Linear(...)

    def forward(x):
       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.model_a(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_1(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_2(x)
        return x

    def self_balance():
        return [self.model_a, self.layer_1, self.layer_2]

So the above does two cool things:

user says how they want to break up the model.
In the forward, we auto put the input on that module's GPU.

That's the easy part lol... the hard part is deciding how to balance... optimizing for speed so you minimize data transfer across GPUs while not blowing up the RAM and using the RAM efficiently.

Anyone want to give this a shot?

sholalkere · 2019-08-07T04:44:52Z

Could use something similar to this to approximate mem usage per layer/module and then balance accordingly.

williamFalcon · 2019-08-07T09:58:52Z

that’s helpful. You also beed to account for the size of the inout and output including taking batch size into account. sometimes the problem is that the layer output blows up the ram. so we’d need to probably try catch a few passes through each block and calculate its full memory requirement.

the memory requirement is weights + input + output. and gpu 0 has added overhead of optimizer which in case of adam has the grads

* [PYT-219] Update code blocks * Give "Code Out" same padding as pre

stale · 2020-08-07T14:34:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tchaton · 2020-11-10T08:58:15Z

@SeanNaren, Fairscale should partially provide this feature with OSS right ?

SeanNaren · 2020-11-10T09:50:37Z

@tchaton, not exactly! This is covered by #4443 which introduces the pipe accelerator (allows you to split a model across GPUs). The self balancing part isn't easy, but can be done via functions like this in fairscale:
https://github.com/facebookresearch/fairscale/blob/7c5203eb772d7c67e45ed6ff6b66579b8e5cbc6c/fairscale/nn/pipe/balance/__init__.py#L100

I've been looking into the pipe accelerator but there are a few nice changes coming up with this PR: facebookresearch/fairscale#156

Would be nice to get them in first before adding the plugin/accelerator for this :)

LithiumH · 2020-12-16T02:25:17Z

Has there been any progress on this feature? I see that there’s a Beta section on the documentation here: https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#model-parallelism-beta but I don’t know if this works with DDP

tchaton · 2021-03-09T12:37:09Z

Any updates on this issues ?

SeanNaren · 2021-03-09T12:53:40Z

Hey @tchaton, a small update :)

It's been a while and supporting transparent self-balancing architectures with no friction hasn't been solved, and that's primarily due to the difficulty of engineering such balancing.

In most cases this requires a lot of engineering effort, and even our pipe implementation is very specific/provides little flexibility when using.

The current roadmap tends to Fully Sharded Data Parallel replacing the need for self-balancing, by allowing the user to annotate layers (or automate annotation) with FSDP, signalling that these layers should be loaded into memory, do any necessary computation and be de-allocated ASAP. This allows to scale the model size drastically and trade off time. If anyone is interested, look at our initial integration which we're working with the FairScale team to prove out and ensure we have rigid tests/benchmarks in place #6152

SeanNaren · 2021-06-01T13:20:18Z

A lot has changed since this issue, and I'd like to summarize:

There are two ways to consider scaling architectures

Split layers onto devices manually
Split all layers equally onto devices

1 is extremely difficult to get right when architectures are large and complicated and to maintain effeciency. 2 which in recent years via DeepSpeed and now FairScale are more prominent, offer an elegant way to scale model architecture with minimal annotation.

Fully Sharded Data Parallel has been merged, and offers the ability to leverage 2 and in most cases, solve the underlying scaling issue. I have a PR for FSDP documentation #7791 which will hopefully explain more as to how this works :) Once merged, we should be able to close this!

EDIT code example:

import torch.nn as nn
import pytorch_lightning as pl
from pytorch_lightning import Trainer
from fairscale.nn import wrap

class MyModule(LightningModule):
    def configure_sharded_model(self):
        # layers will be sharded across all devices
        model_a = wrap(SomeModel())
        layer_1 = wrap(Linear(...))
        layer2 = wrap(Linear(...))
        self.model = nn.Sequential(model_a, layer_1, layer_2)

    def forward(x):
        x = self.model(x)
        return x

model = MyModule()
trainer = Trainer(gpus=4, plugins='fsdp')
trainer.fit(model)

carmocca · 2023-07-19T17:13:48Z

Closing this super old issue. strategy="fsdp" is your friend.

You can find guides at https://lightning.ai/docs/pytorch/latest/advanced/model_parallel.html for the Trainer and https://lightning.ai/docs/fabric/latest/advanced/model_parallel/fsdp.html for Fabric

williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 6, 2019

williamFalcon mentioned this issue Aug 7, 2019

codecov doesn't respect ignore #65

Closed

luiscape pushed a commit to luiscape/pytorch-lightning that referenced this issue Jan 17, 2020

[PYT-219] Update code blocks (Lightning-AI#50)

836b449

* [PYT-219] Update code blocks * Give "Code Out" same padding as pre

awaelchli mentioned this issue Jan 31, 2020

Model Parallel #774

Closed

Borda added this to the 0.7.0 milestone Feb 2, 2020

Borda modified the milestones: 0.7.0, 0.6.2 Feb 15, 2020

Borda modified the milestones: 0.7.2, 0.7.3 Mar 26, 2020

Borda modified the milestones: 0.7.4, 0.8.0 Apr 24, 2020

Borda added the Important label May 16, 2020

Borda modified the milestones: 0.8.0, 0.9.0 Jun 8, 2020

stale bot added the won't fix This will not be worked on label Aug 7, 2020

edenlightning removed the won't fix This will not be worked on label Aug 11, 2020

edenlightning modified the milestones: 0.9.0, 0.9.x Aug 18, 2020

edenlightning modified the milestones: 0.9.x, 1.1 Sep 1, 2020

edenlightning modified the milestones: 1.1, 1.2 Oct 19, 2020

carmocca mentioned this issue Feb 8, 2021

Multigpu with different RAM capabilities #5847

Closed

edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021

dbonner mentioned this issue Feb 26, 2021

apply_func.py: from torchtext.legacy.data import Batch #6210

Closed

edenlightning modified the milestones: v1.3, v1.4 Apr 27, 2021

edenlightning modified the milestones: v1.4, v1.5 Jun 30, 2021

carmocca modified the milestones: v1.5, v1.6 Oct 27, 2021

carmocca removed the Important label Nov 14, 2021

carmocca modified the milestones: 1.6, None Feb 1, 2022

carmocca removed this from the future milestone Jul 19, 2023

carmocca closed this as completed Jul 19, 2023

ZekunZh mentioned this issue Sep 26, 2023

Investigate Resident Memory Increase during Inference #18640

Open

waynemystir mentioned this issue Feb 27, 2024

NCCL when trying to train on 2 nodes #19544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self-balancing architecture #50

self-balancing architecture #50

williamFalcon commented Aug 6, 2019 •

edited

Loading

sholalkere commented Aug 7, 2019

williamFalcon commented Aug 7, 2019

stale bot commented Aug 7, 2020

tchaton commented Nov 10, 2020

SeanNaren commented Nov 10, 2020

LithiumH commented Dec 16, 2020

tchaton commented Mar 9, 2021

SeanNaren commented Mar 9, 2021

SeanNaren commented Jun 1, 2021 •

edited

Loading

carmocca commented Jul 19, 2023

self-balancing architecture #50

self-balancing architecture #50

Comments

williamFalcon commented Aug 6, 2019 • edited Loading

sholalkere commented Aug 7, 2019

williamFalcon commented Aug 7, 2019

stale bot commented Aug 7, 2020

tchaton commented Nov 10, 2020

SeanNaren commented Nov 10, 2020

LithiumH commented Dec 16, 2020

tchaton commented Mar 9, 2021

SeanNaren commented Mar 9, 2021

SeanNaren commented Jun 1, 2021 • edited Loading

carmocca commented Jul 19, 2023

williamFalcon commented Aug 6, 2019 •

edited

Loading

SeanNaren commented Jun 1, 2021 •

edited

Loading