Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution) #2339

gipert · 2023-07-01T16:47:24Z

I'm trying to write a profile to run this workflow on NERSC's supercomputer. Batch computing nodes have 128 CPU cores (x2 hyperthreads) and 512 GB of memory. Submission is managed through SLURM and maximum wall time is 12h.

My workflow is mostly composed by a large number of ~1h long, single-threaded jobs. I would like to instruct Snakemake to pack them efficiently and submit a much lower number of jobs to SLURM. Workflows running on a node should profit from all available resources and run in parallel.

This is what I've written so far:

configfile: config.json
keep-going: true
quiet: rules

# profit from Perlmutter's scratch area: https://docs.nersc.gov/filesystems/perlmutter-scratch
# NOTE: should actually set this through the command line, since there is a
# scratch directory for each user and variable expansion does not work here:
#   $ snakemake --shadow-prefix "$PSCRATCH" [...]
# shadow-prefix: "$PSCRATCH"

# NERSC uses the SLURM job scheduler
# - https://snakemake.readthedocs.io/en/stable/executing/cluster.html#executing-on-slurm-clusters
slurm: true

# maximum number of cores requested from the cluster or cloud scheduler
cores: 256
# maximum number of cores used locally, on the interactive node
local-cores: 256
# maximum number of jobs that can exist in the SLURM queue at a time
jobs: 50

# reasonable defaults that do not stress the scheduler
max-jobs-per-second: 20
max-status-checks-per-second: 20

# (LEGEND) NERSC-specific settings
# - https://snakemake.readthedocs.io/en/stable/executing/cluster.html#advanced-resource-specifications
# - https://docs.nersc.gov/jobs
default-resources:
  - slurm_account="m2676"
  - constraint="cpu"
  - runtime=120
  - mem_mb=500
  - slurm_extra="--qos regular --licenses scratch,cfs"

# number of threads used by each rule
set-threads:
  - tier_ver=1
  - tier_raw=1

# memory and runtime requirements for each single rule
# - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources
# - https://docs.nersc.gov/jobs/#available-memory-for-applications-on-compute-nodes
set-resources:
  - tier_ver:mem_mb=500
  - tier_ver:runtime=120
  - tier_raw:mem_mb=500
  - tier_raw:runtime=120

# we define groups in order to let Snakemake group rule instances in the same
# SLURM job. relevant docs:
# - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-grouping
# - https://snakemake.readthedocs.io/en/stable/executing/grouping.html#job-grouping
groups:
  - tier_ver=sims
  - tier_raw=sims

# disconnected parts of the workflow can run in parallel (at most 256 of them)
# in a group
group-components:
    - sims=256

And this is the relevant part of Snakemake's output:

> snakemake --profile workflow/profiles/nersc-batch --verbose
sbatch call: sbatch --job-name 02d1132e-27d6-4d5c-aed4-3a88e1d30e93 -o .snakemake/slurm_logs/group_sims/%j.log --export=ALL -A m2676 -t 120 -C cpu --mem 20000 --cpus-per-task=40 --qos regular --licenses scratch,cfs -D /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1 --wrap='/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/tools/snakemake-mambaforge3/envs/snakemake/bin/python3.11 -m snakemake --snakefile '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/workflow/Snakefile'"'"' --target-jobs [ELIDED] --allowed-rules [tier_raw ... ELIDED ... tier_raw] --local-groupid '"'"'eaff24b4-a825-52a7-9d08-aac77f1f7b10'"'"' --cores '"'"'all'"'"' --attempt 1 --resources '"'"'mem_mb=20000'"'"' '"'"'disk_mib=38160'"'"' '"'"'disk_mb=40000'"'"' '"'"'mem_mib=19080'"'"' --wait-for-files-file '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/.snakemake/tmp.gg2t0bhd/snakejob_sims_eaff24b4-a825-52a7-9d08-aac77f1f7b10.waitforfilesfile.txt'"'"' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers '"'"'input'"'"' '"'"'code'"'"' '"'"'mtime'"'"' '"'"'params'"'"' '"'"'software-env'"'"' --skip-script-cleanup  --shadow-prefix '"'"'/pscratch/sd/p/pertoldi'"'"' --conda-frontend '"'"'mamba'"'"' --wrapper-prefix '"'"'https://github.com/snakemake/snakemake-wrappers/raw/'"'"' --configfiles '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/config.json'"'"' --latency-wait 5 --scheduler '"'"'greedy'"'"' --scheduler-solver-path '"'"'/global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/tools/snakemake-mambaforge3/envs/snakemake/bin'"'"' --set-resources '"'"'tier_ver:mem_mb=500'"'"' '"'"'tier_ver:runtime=120'"'"' '"'"'tier_raw:mem_mb=500'"'"' '"'"'tier_raw:runtime=120'"'"' --default-resources '"'"'mem_mb=500'"'"' '"'"'disk_mb=max(2*input.size_mb, 1000)'"'"' '"'"'tmpdir=system_tmpdir'"'"' '"'"'slurm_account="m2676"'"'"' '"'"'constraint="cpu"'"'"' '"'"'runtime=120'"'"' '"'"'slurm_extra="--qos regular --licenses scratch,cfs"'"'"'  --slurm-jobstep --jobs 1 --mode 2'
Job eaff24b4-a825-52a7-9d08-aac77f1f7b10 has been submitted with SLURM jobid 10939730 (log: .snakemake/slurm_logs/group_sims/10939730.log).

And this is the content of that log file:

> cat .snakemake/slurm_logs/group_sims/10939730.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 256
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=20000, disk_mib=38160, disk_mb=40000, mem_mib=19080
Select jobs to execute...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=500, mem_mib=477, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Sat Jul  1 09:28:58 2023]
Job 0: Producing output file for job 'raw.l200a-wls-reflector-Rn222-to-Po214.0'
Reason: Missing output files: /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root

Changing to shadow directory: /pscratch/sd/p/pertoldi/shadow/tmpb727ncaf
Write-protecting output file /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root.
[Sat Jul  1 09:32:15 2023]
Finished job 0.
1 of 1 steps (100%) done
Write-protecting output file /global/cfs/cdirs/m2676/users/pertoldi/legend-prodenv/sims/benchmark-1/generated/tier/raw/l200a-fibers-Rn222-to-Po214/l200a-fibers-Rn222-to-Po214_0000.root.
[Sat Jul  1 09:32:16 2023]
Finished job 23.
1 of 40 steps (2%) done
Select jobs to execute...
srun: Job 10939730 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=10939730.1
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=500, mem_mib=477, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Sat Jul  1 09:33:15 2023]
Job 0: Producing output file for job 'raw.l200a-wls-reflector-Rn222-to-Po214.0'
[...]

As you can see, jobs are serially executed on the node even if they are independent between each other.

What's wrong in my profile?

gipert · 2023-07-02T11:32:53Z

Update: removing the --slurm-jobstep at the end of the Snakemake command being executed on the batch node seems to fix the issue. That option takes care of prepending the right srun call:

snakemake/snakemake/executors/slurm/slurm_jobstep.py

Line 106 in bad9115

call = f"srun -n1 --cpu-bind=q {self.format_job_exec(job)}"

but why does this produce a serial execution?

gipert · 2023-07-02T17:52:19Z

Seems like I'm experiencing the same issue reported here: #2060

cmeesters · 2024-05-06T09:31:36Z

sorry for looking to late into this issue - since Snakemake v8 the executor code for SLURM has its own repo.

Does the issue persist for you after updating?

gipert · 2024-05-06T09:34:24Z

I need to check again. Is this snakemake/snakemake-executor-plugin-slurm#29 resolved?

pachi · 2024-10-18T15:48:21Z

I had a similar case and v7.32.3 had the same problem while v8.23.2 works as expected.

gipert mentioned this issue Jul 11, 2023

Parallel execution of job groups on SLURM #2060

Closed

gipert mentioned this issue Jul 18, 2023

Make job grouping with cluster execution really work legend-exp/legend-simflow#8

Open

juliawiggeshoff mentioned this issue Oct 6, 2023

Rule with Nextflow workflow not being executed concurrently for multiple samples on SLURM #2472

Open

gipert mentioned this issue Sep 27, 2024

NERSC profile to configure interaction with SLURM legend-exp/legend-dataflow#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution) #2339

Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution) #2339

gipert commented Jul 1, 2023 •

edited

Loading

gipert commented Jul 2, 2023 •

edited

Loading

gipert commented Jul 2, 2023 •

edited

Loading

cmeesters commented May 6, 2024

gipert commented May 6, 2024

pachi commented Oct 18, 2024

Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution) #2339

Workflow not being executed in parallel on batch computing nodes (SLURM cluster grouped execution) #2339

Comments

gipert commented Jul 1, 2023 • edited Loading

gipert commented Jul 2, 2023 • edited Loading

gipert commented Jul 2, 2023 • edited Loading

cmeesters commented May 6, 2024

gipert commented May 6, 2024

pachi commented Oct 18, 2024

gipert commented Jul 1, 2023 •

edited

Loading

gipert commented Jul 2, 2023 •

edited

Loading

gipert commented Jul 2, 2023 •

edited

Loading