GitHub - zhaojuanmao/fairscale: PyTorch extensions for high performance and large scale training.

Description

FairScale is a PyTorch extension library for high performance and large scale training. This library extends basic PyTorch capabilities while adding new SOTA scaling techniques. FairScale makes available the latest distributed training techniques in the form of composable modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as they attempt to scale models with limited resources.

FairScale was designed with the following values in mind:

Usability - Users should be able to understand and use FairScale APIs with minimum cognitive overload.
Modularity - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
Performance - FairScale APIs provide the best performance in terms of scaling and efficiency.

Installation

To install FairScale, please see the following instructions. You should be able to install a pip package or build directly from source.

Getting Started

The full documentation contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.

Examples

Here are a few sample snippets from a subset of FairScale offerings:

Pipe

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)

Optimizer state sharding (ZeRO)

See a more complete example here, but a minimal example could look like the following :

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from fairscale.optim.oss import OSS
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP

def train(
    rank: int,
    world_size: int,
    epochs: int):

    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)

    # Problem statement
    model = myAwesomeModel().to(rank)
    dataloader = mySuperFastDataloader()
    loss_fn = myVeryRelevantLoss()
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

    # Wrap the optimizer in its state sharding brethren
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

if __name__ == "__main__":
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )

AdaScale SGD

AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel) training or non-DDP with gradient accumulation. The benefit is to re-use the same LR schedule from a baseline batch size when effective batch size is bigger.

Note that AdaScale does not help increase per-GPU batch size.

from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True

Primary goal is to allow scaling to bigger batch sizes without losing model accuracy. (However, training time might be longer comparing to without AdaScale.)

At a high level, we want ML researchers to:

go parallel more easily (i.e. no need to find new learning rate schedules)
not worrying about losing accuracy
potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)

Testing

We use circleci to test on PyTorch versions 1.6.0, 1.7.1, and 1.8.1. Please create an issue if you are having trouble with installation.

Contributors

We welcome outside contributions! Please see the CONTRIBUTING instructions for how you can contribute to FairScale.

License

FairScale is licensed under the BSD-3-Clause License.

Citing FairScale

If you use FairScale in your publication, please cite it by using the following BibTeX entry.

@Misc{FairScale2021,
  author =       {Mandeep Baines, Shruti Bhosale, Vittorio Caggiano, Naman Goyal, Siddharth Goyal, Myle Ott, Benjamin Lefaudeux, Vitaliy Liptchinsky, Mike Rabbat, Sam Sheiffer, Anjali Sridhar, Min Xu},
  title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
  howpublished = {\url{https://github.com/facebookresearch/fairscale}},
  year =         {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 489 Commits
.circleci		.circleci
.github		.github
benchmarks		benchmarks
conda.recipe		conda.recipe
docs		docs
fairscale		fairscale
stubs/torch		stubs/torch
tests		tests
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
RELEASE.md		RELEASE.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
requirements-benchmarks.txt		requirements-benchmarks.txt
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Installation

Getting Started

Examples

Pipe

Optimizer state sharding (ZeRO)

AdaScale SGD

Testing

Contributors

License

Citing FairScale

About

Releases

Packages

Languages

License

zhaojuanmao/fairscale

Folders and files

Latest commit

History

Repository files navigation

Description

Installation

Getting Started

Examples

Pipe

Optimizer state sharding (ZeRO)

AdaScale SGD

Testing

Contributors

License

Citing FairScale

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages