Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot submit jobs from a GPU Node #482

Open
ziw-liu opened this issue Sep 24, 2024 · 4 comments
Open

[BUG] Cannot submit jobs from a GPU Node #482

ziw-liu opened this issue Sep 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@ziw-liu
Copy link
Contributor

ziw-liu commented Sep 24, 2024

When submitting jobs on a GPU node, the generated batch job requests GRES that is not valid:

srun: error: Unable to create step for job 16413634: Invalid generic resource (gres) specification

Submitting the same job on a login node succeeds.

@ziw-liu ziw-liu added the bug Something isn't working label Sep 24, 2024
@ieivanov
Copy link
Contributor

ieivanov commented Oct 8, 2024

Here is the error I got with similar submision:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

@edyoshikun
Copy link
Contributor

I haven't been able to reproduce this bug. I used a reservation done via

 sbatch --job-name=nomachine --constraint=nomachine --partition=interactive --mem-per-cpu=8G --cpus-per-task=16 --gpus=1 --time=5-0:00:00 --wrap "sleep 120h" --output=$HOME/logs/sbatch.out --nodelist=gpu-sm01-02

submitted jobs for deskew and reconstruction without a problem

@talonchandler
Copy link
Collaborator

talonchandler commented Feb 10, 2025

@tayllatheodoro, can you add your notes here? You said that you found a workaround by commenting monitor_jobs?

@amitabhverma, @ieivanov gave me a verbal report that he ran the recOrder GUI reconstruction from a gpu node and found an error that he figured was related to this issue.

@amitabhverma can you remind which HPC node you've been testing with? A GPU node or a login node?

@amitabhverma
Copy link
Collaborator

@talonchandler I don't think my issue was the same and never encountered the OP error. I could not run the reconstruction via a login node but it runs fine via a GPU/nomachine node. I use the Bruno interactive to setup my session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants