why won't auto resubmit work? #3351
Replies: 13 comments
-
after the job finishes, does it completely terminate or is there any process still running in the background? and did you observe checkpoints being saved? |
Beta Was this translation helpful? Give feedback.
-
Yes, the checkpoint is saved. Also, the job terminates completely and the slurm log file says, "Job cancelled because time value exceeded" |
Beta Was this translation helpful? Give feedback.
-
@vr25 mind add important snapshots or your code some it helps also other in the future? |
Beta Was this translation helpful? Give feedback.
-
@Borda Both code and slurm script are linked in the reported bug description. Please let me know if you cannot access it. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@vr25 I know this is not related to the issue reported, but looking at your script here I noticed that your main script is not guarded by |
Beta Was this translation helpful? Give feedback.
-
@awaelchli Yes, you're right, I just got the code from here. I am sorry but which last line is commented? code or slurm script? |
Beta Was this translation helpful? Give feedback.
-
Here trainer = pl.Trainer(max_epochs=1000, gpus=1) #4, num_nodes=4, distributed_backend='ddp',) There is a comment there, the ddp should be included, im pretty surw |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@vr25 In your sbatch options, you have this: #SBATCH --time=00:00:02 Seems to me like you're asking for 2 seconds rather than 2 minutes, which would explain why it's not working. There's no time for the code to kick in at this speed 😛 You should try Edit: typo |
Beta Was this translation helpful? Give feedback.
-
oh!! Right, thanks for pointing it out. |
Beta Was this translation helpful? Give feedback.
-
@nathanpainchaud you have good eyes!! |
Beta Was this translation helpful? Give feedback.
-
If the hint by @nathanpainchaud did not work, let us know and we can reopen. |
Beta Was this translation helpful? Give feedback.
-
@awaelchli (just tagged based on the previous issue here)
🐛 Bug
auto resubmit doesn't seem to work
To Reproduce
Steps to reproduce the behavior:
sbatch sample_05_gpu.sh
Code sample
Please take a look at my code and job script
Expected behavior
Should checkpoint and restart from the last checkpoint every 2 minutes.
Environment
torch = 1.3.1
pytorch_lightning = 0.9.0
Beta Was this translation helpful? Give feedback.
All reactions