Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training hangs with DeepSpeed when DDP workers have different number of training batches #13498

Closed
xinyangz opened this issue Jul 1, 2022 · 2 comments
Labels
data handling Generic data-related topic strategy: deepspeed

Comments

@xinyangz
Copy link

xinyangz commented Jul 1, 2022

🐛 Bug

My use case involves streaming a large dataset for distributed training. During this process, each distributed worker may get different number of training batches. Please see the boring model example bellow for an equivalent case.

When turning DeepSpeed integration on, the code hangs after one full epoch. All GPUs have 100% utilization, while GPU power remains low. I cannot pinpoint the error as keyboard interrupt wouldn't work and I have to kill everything.

The training does not hang if DeepSpeed is turned off. I'm not quite sure if this is a lightning bug or a DeepSpeed bug.

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset, IterableDataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=10,
        enable_model_summary=False,
        strategy="deepspeed_stage_1",     # DeepSpeed turned on
        accelerator="gpu",
        devices=2,    # bug happens when n_gpu > 1
    )
    train_data = DataLoader(RandomDataset(32, 64 + trainer.local_rank * 4), batch_size=2)   # each DDP worker gets different number of batches
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    # trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

Expected behavior

Training finishes without hanging.

Environment

  • CUDA:
    - GPU:
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - available: True
    - version: 11.3
  • Packages:
    - numpy: 1.21.6
    - pyTorch_debug: False
    - pyTorch_version: 1.11.0+cu113
    - pytorch-lightning: 1.6.4
    - tqdm: 4.64.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.7.12
  • Any other relevant information: DeepSpeed version 0.6.5

Additional context

cc @justusschock @awaelchli @ninginthecloud @rohitgr7 @otaj @SeanNaren @akihironitta

@xinyangz xinyangz added the needs triage Waiting to be triaged by maintainers label Jul 1, 2022
@awaelchli
Copy link
Contributor

This is not supported currently. We don't support uneven dataset sizes. Besides, the following

train_data = DataLoader(RandomDataset(32, 64 + trainer.local_rank * 4), batch_size=2)

would lead to a different set of problems too.
Make sure you return the same dataset in all ranks so that the distributed sampler can shard the data across all devices equally. The data should be split equally between the ranks because there is no reason why one GPU should do more work while the others should stay idle.

I highly recommend you reconsider your approach. Sorry if this wasn't clear from the docs/tutorials.

@akihironitta akihironitta added strategy: deepspeed data handling Generic data-related topic and removed needs triage Waiting to be triaged by maintainers labels Jul 3, 2022
@xinyangz
Copy link
Author

xinyangz commented Jul 7, 2022

Thank you for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic strategy: deepspeed
Projects
None yet
Development

No branches or pull requests

3 participants