-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slurm --clusters support #2504
Comments
It is easy to make a custom batch system handler for a hardwired specific cluster. E.g. for cluster foo, make from cylc.batch_sys_handlers.slurm import SLURMHandler
class SLURMClusterFooHandler(SLURMHandler):
"""SLURM job submission and manipulation for --clusters=foo."""
KILL_CMD_TMPL = "scancel --clusters=foo '%(job_id)s'"
POLL_CMD = "squeue -h --clusters=foo"
SUBMIT_CMD_TMPL = "sbatch --clusters=foo '%(job)s'" # --clusters optional here as it's a job directive
BATCH_SYS_HANDLER = SLURMClusterKupeHandler() Obviously it would be better to have a single slurm handler that extracts the cluster name - if present - from job directives though. Different jobs could potentially use different clusters. Without altering the core of Cylc, we could have the job submit, poll, and kill commands search for the cluster in the job script before executing, every time they are invoked. This seems a bit perverse though. Another option would be to have the task proxy remember the cluster name, if provided, and pass it to the job submit, poll, and kill methods each time. At a cost of one new task proxy attribute that will only ever be used by slurm jobs. [actually, duh - no cost, all directives are already remembered] @cylc/core - any strong opinions on this? Or other ideas? |
@hjoliver - Thinking about this the other way round, could we maybe provide a config option under [job] that would be inserted into slurm directives (ignored otherwise) and used in the commands accordingly? (similar to the execution time limit entry) |
Trouble is we need this already on the new HPC at NIWA (although we could make do with the nasty hard-wired kludge above) |
Plans were made on the assumption that slurm worked like other batch systems in this respect. |
From local HPC engineer:
|
The latest release of slurm apparently supports a single unique job ID across a federated cluster. For the moment though, if you submit a job to slurm on host X with
#SBATCH --clusters=Y
to make it run on host Y, any subsequent job interaction via the resulting job ID has to be done on host Y or else with--clusters=Y
on the command line (i.e. the job ID is not recognized on the original submission host).This way of submitting remote jobs without ssh is fine with Cylc, if hosts X and Y see the same filesystem (i.e. the job looks local to Cylc, even though it technically isn't). But with slurm subsequent job poll or kill fails because the Job ID is not recognized locally, and Cylc does not know to use
--clusters=Y
on the squeue and scancel command lines.We should either make this work in Cylc or else document the problem and recommend using remote mode for the moment.
The text was updated successfully, but these errors were encountered: