slurm --clusters support #2504

hjoliver · 2017-12-06T02:45:30Z

The latest release of slurm apparently supports a single unique job ID across a federated cluster. For the moment though, if you submit a job to slurm on host X with #SBATCH --clusters=Y to make it run on host Y, any subsequent job interaction via the resulting job ID has to be done on host Y or else with --clusters=Y on the command line (i.e. the job ID is not recognized on the original submission host).

This way of submitting remote jobs without ssh is fine with Cylc, if hosts X and Y see the same filesystem (i.e. the job looks local to Cylc, even though it technically isn't). But with slurm subsequent job poll or kill fails because the Job ID is not recognized locally, and Cylc does not know to use --clusters=Y on the squeue and scancel command lines.

We should either make this work in Cylc or else document the problem and recommend using remote mode for the moment.

The text was updated successfully, but these errors were encountered:

hjoliver · 2017-12-06T02:56:56Z

It is easy to make a custom batch system handler for a hardwired specific cluster. E.g. for cluster foo, make slurm_cluster_foo.py:

from cylc.batch_sys_handlers.slurm import SLURMHandler

class SLURMClusterFooHandler(SLURMHandler):
    """SLURM job submission and manipulation for --clusters=foo."""
    KILL_CMD_TMPL = "scancel --clusters=foo '%(job_id)s'"
    POLL_CMD = "squeue -h --clusters=foo"
    SUBMIT_CMD_TMPL = "sbatch --clusters=foo '%(job)s'"  # --clusters optional here as it's a job directive

BATCH_SYS_HANDLER = SLURMClusterKupeHandler()

Obviously it would be better to have a single slurm handler that extracts the cluster name - if present - from job directives though. Different jobs could potentially use different clusters.

Without altering the core of Cylc, we could have the job submit, poll, and kill commands search for the cluster in the job script before executing, every time they are invoked. This seems a bit perverse though.

Another option would be to have the task proxy remember the cluster name, if provided, and pass it to the job submit, poll, and kill methods each time. At a cost of one new task proxy attribute that will only ever be used by slurm jobs. [actually, duh - no cost, all directives are already remembered]

@cylc/core - any strong opinions on this? Or other ideas?

arjclark · 2017-12-06T09:31:37Z

@hjoliver - Thinking about this the other way round, could we maybe provide a config option under [job] that would be inserted into slurm directives (ignored otherwise) and used in the commands accordingly? (similar to the execution time limit entry)

matthewrmshin · 2017-12-06T09:38:59Z

We are likely to make life a lot easier for this issue when we solve #2199. (We'll schedule the work to commence after #2468 is merged.)

hjoliver · 2017-12-06T09:41:00Z

Trouble is we need this already on the new HPC at NIWA (although we could make do with the nasty hard-wired kludge above)

hjoliver · 2017-12-06T09:41:53Z

Plans were made on the assumption that slurm worked like other batch systems in this respect.

matthewrmshin · 2017-12-06T09:45:45Z

@hjoliver Understood. I think it is best to use a custom batch system handler for now. I have really wanted to work on #2199 before the end of this year, but that hasn't happened.

hjoliver · 2020-01-22T07:43:51Z

From local HPC engineer:

Federation doesn’t look an appropriate way forward:

“A job is submitted to the local cluster (the cluster defined in the slurm.conf) and is then replicated across the clusters in the federation. Each cluster then independently attempts to the schedule the job based off of its own scheduling policies.“

hjoliver added this to the soon milestone Dec 6, 2017

matthewrmshin modified the milestones: soon, cylc-8.0.0 Aug 28, 2019

hjoliver changed the title ~~slurm --cluster support~~ slurm --clusters support Jan 22, 2020

hjoliver modified the milestones: cylc-8.0.0, cylc-8.x Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm --clusters support #2504

slurm --clusters support #2504

hjoliver commented Dec 6, 2017 •

edited

Loading

hjoliver commented Dec 6, 2017 •

edited

Loading

arjclark commented Dec 6, 2017

matthewrmshin commented Dec 6, 2017

hjoliver commented Dec 6, 2017

hjoliver commented Dec 6, 2017

matthewrmshin commented Dec 6, 2017

hjoliver commented Jan 22, 2020

slurm --clusters support #2504

slurm --clusters support #2504

Comments

hjoliver commented Dec 6, 2017 • edited Loading

hjoliver commented Dec 6, 2017 • edited Loading

arjclark commented Dec 6, 2017

matthewrmshin commented Dec 6, 2017

hjoliver commented Dec 6, 2017

hjoliver commented Dec 6, 2017

matthewrmshin commented Dec 6, 2017

hjoliver commented Jan 22, 2020

hjoliver commented Dec 6, 2017 •

edited

Loading

hjoliver commented Dec 6, 2017 •

edited

Loading