Compute cluster or site platform awareness #2199

hjoliver · 2017-03-09T18:31:54Z

Ref: https://groups.google.com/forum/#!topic/cylc/dFVNTeyPcrs

We could make Cylc aware of the concept of a group of HPC login nodes. If the original job submit node goes down, we could try the alternative node(s) in the event that a poll or kill command fails with host not found.

matthewrmshin · 2017-03-10T09:26:24Z

We need to have a concept similar rose host-select in cylc, but geared towards handling login nodes of clusters. (rose host-select was designed to be a poor-person's load balancing system for a group of similar compute servers. We no longer have this requirement at our site, but we still have a requirement to randomly select login nodes of clusters these days. Clearly, this requirement should be met by better DNS routing on sites, but this is not always the case.)

I think we can do something like this:

In the global.rc, we'll have a [host-groups] section. Each entry will have something like host-group=host1, host2, .... Each host group can be assumed to share the same file system, batch system, etc.
In the suite.rc, we'll allow [runtime][TASK][remote]host=HOST-GROUP. A random host in the specified host group will be used for job submission, poll, kill, log retrieval, etc.

matthewrmshin · 2017-03-10T09:27:17Z

(Promoting the milestone and self assigned, to avoid this being lost in the ether.)

hjoliver · 2017-03-11T10:01:32Z

@matthewrmshin - your proposal sounds good, but you haven't explicitly addressed what to do if the (randomly) chosen host goes down. Presumably (as I suggested above) we'd need a retry-via-other-host mechanism to handle poll and kill (etc.) failures due to the target host going offline? Of course this would only work for jobs submitted to a batch scheduler (background jobs running on a particular login node are just screwed if the node goes down).

matthewrmshin · 2017-04-26T19:40:55Z

OK. We'll make sure to consider:

A host group only makes sense for a relevant batch scheduler. This must be configurable.
We will choose a random available host in the host group for job submit, poll and kill. If one host is unavailable, we'll pick the next one in the randomised list, until exhausted.

matthewrmshin · 2017-10-20T20:15:26Z

Change of title to allow a more general discussion of compute cluster support. (The suite host may be part of the cluster, so it is not limited to handling of login nodes.)

I can now see that global.rc should have a new clusters section that will mostly supersede the current hosts section.

# global.rc
[clusters]  # platforms?
    [[spicy]]
        login hosts = peppercorn, clove, cinnamon, fennel, star-anise
        batch system = slurm
        # and pretty much everything under a host subsection in the hosts section
    [[hedge-pea-sea]]
        login hosts = localhost
        batch system = pbs
        # and so on

With clusters, I think the following may also be relevant:

Custom job management (e.g. submit, poll, kill) commands.
List of file systems that are shared with the suite host? And other clusters?
A URL for checking the status of the cluster?
Custom logic to invoke for collecting job accounting information when a job completes?
Batch scheduler directives that should be added to all jobs?
Number of jobs a user can submit to a cluster at a given time.
Hold all tasks that target a cluster. E.g. cluster is scheduled for an outage. Hold (pause) by job host. #2144.

matthewrmshin · 2018-01-25T09:20:06Z

Somewhat related to #2144 and #2528. A recent unexpected outage meant that jobs were drained from the cluster while it remained down for an extended period of time. Suites were unable to poll or kill submitted/running tasks on the cluster. It would be nice if:

Users are able to reset all jobs submitted to a cluster in a single command.
Suites are able to detect this automatically (via a site setting?) and are then able to reset all affected tasks to go into statuses like submit-failed. (See also new task state 'killed'? #2394.)

hjoliver · 2018-01-25T09:43:04Z

This issue has superseded #2144, absorbing the hold by host/cluster feature request.

matthewrmshin · 2019-02-01T09:49:31Z

See also this discussion https://groups.google.com/forum/?fromgroups=#!topic/cylc/KoFhCGurLTo - we should also consider the ability to configure cluster specific environment variables or even extra custom logic.

wxtim · 2019-07-31T09:52:36Z

Having started to consider this I think that there are some issues with describing all possible job hosts as "clusters". I have come to the view that I prefer the phrase "job platforms" which doesn't imply anything about whether we are running our jobs on a raspi0, or a desktop, or a cray.

wxtim · 2021-01-28T14:16:48Z

@hjoliver Can we close this issue?
I think the only outstanding issue related is #3827

hjoliver · 2021-01-28T20:13:21Z

Yes, good. Thanks for the reminder @wxtim

hjoliver added this to the some-day milestone Mar 9, 2017

matthewrmshin modified the milestones: later, some-day Mar 10, 2017

matthewrmshin self-assigned this Mar 10, 2017

matthewrmshin mentioned this issue Apr 26, 2017

cylc poll: tell suite that a task can be polled using a different login node on a cluster #2174

Closed

matthewrmshin mentioned this issue Apr 26, 2017

Hold (pause) by job host. #2144

Closed

matthewrmshin mentioned this issue May 17, 2017

Job host selection and preparation in workers or sub-processes #2292

Closed

matthewrmshin modified the milestones: soon, later Jun 22, 2017

dvalters self-assigned this Jul 13, 2017

dvalters removed their assignment Oct 18, 2017

matthewrmshin changed the title ~~Cylc awareness of multiple login nodes.~~ Compute cluster awareness Oct 20, 2017

matthewrmshin mentioned this issue Nov 8, 2017

Init remote via multiprocessing pool #2468

Merged

matthewrmshin mentioned this issue Dec 6, 2017

slurm --clusters support #2504

Open

matthewrmshin mentioned this issue Jan 16, 2018

Unable to get cylc to use PBS #2541

Closed

matthewrmshin added the efficiency For notable efficiency improvements label Jan 21, 2018

This was referenced Apr 11, 2018

global.rc: per host batch system commands #1413

Closed

Kill job on task state reset from submitted or running #2621

Closed

oliver-sanders mentioned this issue Apr 23, 2018

Rose user guide rose tutorial metomi/rose#2170

Merged

11 tasks

matthewrmshin changed the title ~~Compute cluster awareness~~ Compute cluster or site platform awareness May 3, 2018

matthewrmshin mentioned this issue Aug 15, 2018

Single batch job for multiple tasks #2754

Open

matthewrmshin mentioned this issue Sep 26, 2018

Support for different $HOME on cylc job remote and execution nodes. #2779

Closed

matthewrmshin mentioned this issue Nov 26, 2018

PoC - GraphQL endpoint Design & Implementation #2873

Closed

matthewrmshin mentioned this issue Jun 11, 2019

global.rc get_derived_host_item refactor #3189

Merged

5 tasks

wxtim mentioned this issue Jul 31, 2019

Replace suite.rc & global.rc with cylc-flow.rc #3260

Closed

matthewrmshin modified the milestones: soon, cylc-8.0a2 Aug 28, 2019

hjoliver assigned wxtim and unassigned matthewrmshin Oct 8, 2019

oliver-sanders mentioned this issue Jan 28, 2020

Host select #3489

Merged

9 tasks

hjoliver modified the milestones: cylc-8.0a2, cylc-8.0a3 Apr 30, 2020

hjoliver closed this as completed Jan 28, 2021

hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute cluster or site platform awareness #2199

Compute cluster or site platform awareness #2199

hjoliver commented Mar 9, 2017 •

edited

Loading

matthewrmshin commented Mar 10, 2017

matthewrmshin commented Mar 10, 2017

hjoliver commented Mar 11, 2017 •

edited

Loading

matthewrmshin commented Apr 26, 2017

matthewrmshin commented Oct 20, 2017 •

edited

Loading

matthewrmshin commented Jan 25, 2018 •

edited

Loading

hjoliver commented Jan 25, 2018

matthewrmshin commented Feb 1, 2019

wxtim commented Jul 31, 2019

wxtim commented Jan 28, 2021

hjoliver commented Jan 28, 2021

Compute cluster or site platform awareness #2199

Compute cluster or site platform awareness #2199

Comments

hjoliver commented Mar 9, 2017 • edited Loading

matthewrmshin commented Mar 10, 2017

matthewrmshin commented Mar 10, 2017

hjoliver commented Mar 11, 2017 • edited Loading

matthewrmshin commented Apr 26, 2017

matthewrmshin commented Oct 20, 2017 • edited Loading

matthewrmshin commented Jan 25, 2018 • edited Loading

hjoliver commented Jan 25, 2018

matthewrmshin commented Feb 1, 2019

wxtim commented Jul 31, 2019

wxtim commented Jan 28, 2021

hjoliver commented Jan 28, 2021

hjoliver commented Mar 9, 2017 •

edited

Loading

hjoliver commented Mar 11, 2017 •

edited

Loading

matthewrmshin commented Oct 20, 2017 •

edited

Loading

matthewrmshin commented Jan 25, 2018 •

edited

Loading