DeepSpeed internal error on CPU #12607

carmocca · 2022-04-04T13:17:50Z

🐛 Bug

DeepSpeed raises an internal error when the Trainer runs on CPU. I imagine they don't support CPU training so we should raise a MisconfigurationException in that case.

To Reproduce

Code

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(1, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="cpu",
        strategy="deepspeed",
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Stacktrace

Traceback (most recent call last):
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 66, in <module>
    run()
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 62, in run
    trainer.fit(model, train_dataloaders=train_data)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1218, in _run
    self.strategy.setup(self)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 360, in setup
    self.init_deepspeed()
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 459, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 492, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 424, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Expected behavior

Better error message

Environment

-e git+https://github.com/PyTorchLightning/pytorch-lightning@523fa74bfe4fcd387c042c7cb22c8abcf3e9f968#egg=pytorch_lightning
torch==1.11.0
torchmetrics==0.7.3
torchtext==0.12.0
deepspeed==0.6.1

cc @Borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

The text was updated successfully, but these errors were encountered:

myxik · 2022-04-08T04:50:41Z

Can i get this assigned to me please?

akihironitta · 2022-04-08T05:55:18Z

@myxik Sure! Thank you :)

gabriead · 2022-07-28T12:42:19Z

The error still appears when using DeepSpeed in combination with Pytorch Lightning on azure

carmocca · 2022-07-28T14:02:29Z

@gabriead Can you share the error stacktrace? What PyTorch Lightning version are you using?

gabriead · 2022-07-28T15:19:45Z

Hi @carmocca sure,

this is the stacktrace of the first exception:

File "TrainingManagerWithDatastore.py", line 214, in main
    trainer.fit(model, data_module)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
    self.strategy.setup(self)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 364, in setup
    self.init_deepspeed()
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 463, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 496, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 428, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

and the second

  File "TrainingManagerWithDatastore.py", line 214, in main
    trainer.fit(model, data_module)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
    self.strategy.setup(self)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 364, in setup
    self.init_deepspeed()
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 463, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 496, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 428, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 238, in __init__
    self._configure_with_arguments(args, mpu)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 851, in _configure_with_arguments
    assert ompi_local_rank == local_rank, f"LOCAL_RANK ({local_rank}) != OMPI_COMM_WORLD_LOCAL_RANK ({ompi_local_rank}), " \
AssertionError: LOCAL_RANK (0) != OMPI_COMM_WORLD_LOCAL_RANK (1), not sure how to proceed as we're seeing conflicting local rank info.

[2022-07-28T13:03:57.863518] Finished context manager injector with Exception.

Pytorch Lightning Version: latest, 1.6.5

awaelchli · 2022-07-28T16:33:50Z

I don't think this fix was included in any 1.6.x release. Would you mind trying master?

carmocca · 2022-07-28T16:39:48Z

Or better yet, the 1.7.0rc0 release: pip install --pre -U pytorch_lightning

carmocca added bug Something isn't working strategy: deepspeed labels Apr 4, 2022

carmocca added this to the 1.6.x milestone Apr 4, 2022

carmocca added the good first issue Good for newcomers label Apr 4, 2022

akihironitta assigned myxik Apr 8, 2022

myxik mentioned this issue Apr 10, 2022

Raise an exception when using DeepSpeed with an invalid accelerator #12699

Merged

11 tasks

carmocca modified the milestones: 1.6.x, 1.7 May 4, 2022

carmocca closed this as completed in #12699 May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed internal error on CPU #12607

DeepSpeed internal error on CPU #12607

carmocca commented Apr 4, 2022 •

edited by github-actions bot

Loading

myxik commented Apr 8, 2022

akihironitta commented Apr 8, 2022

gabriead commented Jul 28, 2022

carmocca commented Jul 28, 2022

gabriead commented Jul 28, 2022

awaelchli commented Jul 28, 2022

carmocca commented Jul 28, 2022

DeepSpeed internal error on CPU #12607

DeepSpeed internal error on CPU #12607

Comments

carmocca commented Apr 4, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

myxik commented Apr 8, 2022

akihironitta commented Apr 8, 2022

gabriead commented Jul 28, 2022

carmocca commented Jul 28, 2022

gabriead commented Jul 28, 2022

awaelchli commented Jul 28, 2022

carmocca commented Jul 28, 2022

carmocca commented Apr 4, 2022 •

edited by github-actions bot

Loading