-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer failing silently for multi-node processing #8993
Comments
Hey @MartaTintore import os
import submitit
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
NUM_GPUS_PER_NODE = 8
NUM_NODES = 2
def train():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
gpus=NUM_GPUS_PER_NODE,
num_nodes=NUM_NODES,
accelerator="ddp",
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
weights_summary=None,
)
assert trainer.world_size == 16
assert isinstance(trainer.training_type_plugin.cluster_environment, SLURMEnvironment)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
def main():
logdir = "debug_log_dir"
os.makedirs(logdir, exist_ok=True)
# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder=logdir)
executor.update_parameters(
mem_gb=80 * NUM_GPUS_PER_NODE,
timeout_min=1500,
slurm_partition={NAME},
gpus_per_node=NUM_GPUS_PER_NODE,
tasks_per_node=NUM_GPUS_PER_NODE,
cpus_per_task=10,
nodes=NUM_NODES,
slurm_constraint="volta32gb",
)
job = executor.submit(train)
if __name__ == "__main__":
main() You may have to fill in a detail here an there, I have not run this myself. |
Hi, Thank you for the reply. By running the basics I realized the problem was with the flags with which I was launching the job. Conclusion: I need to specify both (i) Thanks again for your help! |
@MartaTintore happy to hear that you were able to find the problem. Yes, that's good feedback. Lightning could do a better job at validating that the Lightning parameters match the world size set by the slurm environment. However, to this point:
Unfortunately I believe we cannot know this! It's perfectly valid to run one or multiple training processes purely on CPU, but Lightning cannot know that unless the user provides this information somehow. |
🐛 Bug
When I run the Trainer.fit command with multiple nodes the program fails silently and hangs forever.
If I specify 1 node with multiple GPUs, the process runs. But as soon as I specify 2 or more, the process just hangs indefinitely.
To Reproduce
Expected behavior
Training the model on multiple nodes.
Environment
CUDA:
- GPU:
- Quadro GP100
- Quadro GP100
- available: True
- version: 10.2
Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.9.0
- pytorch-lightning: 1.4.2
- tqdm: 4.61.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Training accuracy #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
The text was updated successfully, but these errors were encountered: