Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed internal error on CPU #12607

Closed
carmocca opened this issue Apr 4, 2022 · 7 comments · Fixed by #12699
Closed

DeepSpeed internal error on CPU #12607

carmocca opened this issue Apr 4, 2022 · 7 comments · Fixed by #12699
Assignees
Labels
bug Something isn't working good first issue Good for newcomers strategy: deepspeed
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Apr 4, 2022

🐛 Bug

DeepSpeed raises an internal error when the Trainer runs on CPU. I imagine they don't support CPU training so we should raise a MisconfigurationException in that case.

To Reproduce

Code

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(1, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="cpu",
        strategy="deepspeed",
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Stacktrace

Traceback (most recent call last):
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 66, in <module>
    run()
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 62, in run
    trainer.fit(model, train_dataloaders=train_data)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1218, in _run
    self.strategy.setup(self)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 360, in setup
    self.init_deepspeed()
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 459, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 492, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 424, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Expected behavior

Better error message

Environment

-e git+https://github.com/PyTorchLightning/pytorch-lightning@523fa74bfe4fcd387c042c7cb22c8abcf3e9f968#egg=pytorch_lightning
torch==1.11.0
torchmetrics==0.7.3
torchtext==0.12.0
deepspeed==0.6.1

cc @Borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

@carmocca carmocca added bug Something isn't working strategy: deepspeed labels Apr 4, 2022
@carmocca carmocca added this to the 1.6.x milestone Apr 4, 2022
@carmocca carmocca added the good first issue Good for newcomers label Apr 4, 2022
@myxik
Copy link
Contributor

myxik commented Apr 8, 2022

Can i get this assigned to me please?

@akihironitta
Copy link
Contributor

@myxik Sure! Thank you :)

@gabriead
Copy link

The error still appears when using DeepSpeed in combination with Pytorch Lightning on azure

@carmocca
Copy link
Contributor Author

@gabriead Can you share the error stacktrace? What PyTorch Lightning version are you using?

@gabriead
Copy link

Hi @carmocca sure,

this is the stacktrace of the first exception:

File "TrainingManagerWithDatastore.py", line 214, in main
    trainer.fit(model, data_module)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
    self.strategy.setup(self)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 364, in setup
    self.init_deepspeed()
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 463, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 496, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 428, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

and the second

  File "TrainingManagerWithDatastore.py", line 214, in main
    trainer.fit(model, data_module)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
    self.strategy.setup(self)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 364, in setup
    self.init_deepspeed()
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 463, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 496, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 428, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 238, in __init__
    self._configure_with_arguments(args, mpu)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 851, in _configure_with_arguments
    assert ompi_local_rank == local_rank, f"LOCAL_RANK ({local_rank}) != OMPI_COMM_WORLD_LOCAL_RANK ({ompi_local_rank}), " \
AssertionError: LOCAL_RANK (0) != OMPI_COMM_WORLD_LOCAL_RANK (1), not sure how to proceed as we're seeing conflicting local rank info.

[2022-07-28T13:03:57.863518] Finished context manager injector with Exception.

Pytorch Lightning Version: latest, 1.6.5

@awaelchli
Copy link
Contributor

I don't think this fix was included in any 1.6.x release. Would you mind trying master?

@carmocca
Copy link
Contributor Author

Or better yet, the 1.7.0rc0 release: pip install --pre -U pytorch_lightning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers strategy: deepspeed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants