Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal support for array jobs via <modellauncher> #3480

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

infotroph
Copy link
Member

@infotroph infotroph commented Mar 11, 2025

Description

I wanted a quick way to submit array jobs to Slurm, and realized I could lightly abuse the modellauncher system by passing my own shell script as a binary. This patch mostly allows passing the number of jobs in the run into the qsub string as @NJOBS@ so that I could write <qsub.extra>-a 1-@NJOBS@</qsub.extra>, plus a couple small cleanups in the modellauncher logic of start_model_runs.

Also: While I was thinking about it, deleted the long-defunct start.model.runs from PEcAn.remote and decided to just roll it in here rather than make a separate PR.

Motivation and Context

I'm currently using this as

 <host>
  <name>localhost</name>
  <outdir>output/out</outdir>
  <rundir>output/run</rundir>
  <qsub>sbatch -J @NAME@ -o @STDOUT@ -e @STDERR@</qsub>
  <qsub.jobid>.*job ([0-9]+).*</qsub.jobid>
  <qstat>squeue -j @JOBID@ || echo DONE</qstat>
  <modellauncher>
    <binary>tools/slurm_array_submit.sh</binary>
    <qsub.extra>-a 1-@NJOBS@</qsub.extra>
  </modellauncher>
 </host>

with slurm_array_submit.sh being a thin wrapper that reads a path from line number $SLURM_ARRAY_TASK_ID of joblist.txt and calls the job.sh that lives at that path.

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

if (!is.null(qsub_extra)) {
qsub <- paste(qsub, qsub_extra)
}
qsub <- gsub("@NAME@", paste0("PEcAn-", run_id_string), qsub_string)
qsub <- gsub("@STDOUT@", file.path(host_outdir, run_id_string, stdout_log), qsub)
qsub <- gsub("@STDERR@", file.path(host_outdir, run_id_string, stderr_log), qsub)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving these down so that @NAME@ etc get replaced in qsub.extra too

close(jobfile)
compt_run <- 0
jobfile <- NULL
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change in logic here, just consolidating if blocks for readability

}
# HACK: Code below gets 'run' from names(jobids) so need an entry for
# each run. But when using modellauncher all runs have the same jobid
jobids[run] <- sub(settings$host$qsub.jobid, "\\1", out[length(out)])
}

} else {
pb <- utils::txtProgressBar(min = 0, max = nruns, style = 3)
pbi <- 0
for (run in job_modellauncher) {
out <- PEcAn.remote::start_serial(
run = run,
Copy link
Member Author

@infotroph infotroph Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the loop missing this run was set to the last run processed in one of the loops above, which is likely to be a directory WITHOUT a launcher.sh in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant