trainer.fit() stuck and cannot interrupt kernel #5947
Unanswered
ifsheldon
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 5 replies
-
You mention Jupyter Lab, did you run this in a cell? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! I am now transferring from "old" PyTorch to pytorch-lightning, but when I did some trivial training integrating existing models, I found trainer.fit() is stuck even before GPUs run.
By "stuck" I mean I waited for 5 minutes, but nothing seems to be running, since I checked using
htop
andnvidia-smi
, CPUs and GPUs are idle.My code is just one-pager as below
I used Jupyter-lab to run the code, and I requested 32 cores, 512GB memory and 4 V100 on a shared cluster. But, when the trainer is stuck, I saw none of GPUs were running and no processes were shown on
nvidia-smi
. And I could not interrupt the kernel, so the only thing I could do is to restart the kernel.I have read the tutorials, and the code seems good to me, but I am not sure whether it's good to go. Did I miss something?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions