Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitit with sbatch #1726

Open
pfrwilson opened this issue Jan 10, 2023 · 6 comments
Open

Submitit with sbatch #1726

pfrwilson opened this issue Jan 10, 2023 · 6 comments
Labels
question Further information is requested

Comments

@pfrwilson
Copy link

Hello, thanks for the great project!

I am working with a slurm cluster which requires compute jobs to be submitted with sbatch rather than srun. It appears that submitit uses srun to submit jobs. Therefore it crashes with a gres-not-available error when I use submitit. Is sbatch supported with submitit?

Thanks in advance!

Paul

@hkayabilisim
Copy link

The same thing happens to me too!

I think the library submits via sbatch but inside the batch script, the python command is send with srun. And it does not recognize gres. The exact error message is:

srun: error: Unable to create step for job 25393: Invalid generic resource (gres) specification

@jrapin
Copy link
Contributor

jrapin commented Jan 19, 2023

Hello,
I've never actually used sbatch without srun, nor seen any slurm cluster which did not support srun. Are you sure srun is unavailable and not some other resource that is unavailable (partition, gpus, etc)? Which version of slurm are you using? If thi is the case, this is probably not something will support unless someone wants to provide a fix (I don't have the bandwidth for this, sorry)

@pfrwilson
Copy link
Author

Hi,

Yes, my slurm cluster fails to allocate GPUs if a job is submitted through srun directly rather than as a batch script through sbatch. The srun command can be used inside the slurm batch submission, but if it is run by itself (eg. srun main.py), prompting the system for synchronous output, the system fails to allocate GREs. I'll have to ask the administrator why the choice was made not to support srun. Typical wait times on the system are long anyway.

I was wondering if submitit to submit the job as you would an sbatch script (eg. sbatch main.py), but sounds like that is not supported... no worries !

@gwenzek
Copy link
Contributor

gwenzek commented Mar 2, 2023

Submitit is calling sbatch submission.sh. You can find the submission.sh file in the log folder.
The generated sbatch does contain a call to srun, but that's standard behavior AFAIK.

@gwenzek gwenzek added the question Further information is requested label Mar 2, 2023
@hkayabilisim
Copy link

I've realized that in the SLURM installation used in our supercomputing center, the resource definition (gres) used in submitit causes an error. This is something related to the configuration of SLURM in our center.

@pfrwilson
Copy link
Author

pfrwilson commented Mar 2, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants