-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute cluster or site platform awareness #2199
Comments
We need to have a concept similar I think we can do something like this:
|
(Promoting the milestone and self assigned, to avoid this being lost in the ether.) |
@matthewrmshin - your proposal sounds good, but you haven't explicitly addressed what to do if the (randomly) chosen host goes down. Presumably (as I suggested above) we'd need a retry-via-other-host mechanism to handle poll and kill (etc.) failures due to the target host going offline? Of course this would only work for jobs submitted to a batch scheduler (background jobs running on a particular login node are just screwed if the node goes down). |
OK. We'll make sure to consider:
|
Change of title to allow a more general discussion of compute cluster support. (The suite host may be part of the cluster, so it is not limited to handling of login nodes.) I can now see that # global.rc
[clusters] # platforms?
[[spicy]]
login hosts = peppercorn, clove, cinnamon, fennel, star-anise
batch system = slurm
# and pretty much everything under a host subsection in the hosts section
[[hedge-pea-sea]]
login hosts = localhost
batch system = pbs
# and so on With clusters, I think the following may also be relevant:
|
Somewhat related to #2144 and #2528. A recent unexpected outage meant that jobs were drained from the cluster while it remained down for an extended period of time. Suites were unable to poll or kill submitted/running tasks on the cluster. It would be nice if:
|
This issue has superseded #2144, absorbing the hold by host/cluster feature request. |
See also this discussion https://groups.google.com/forum/?fromgroups=#!topic/cylc/KoFhCGurLTo - we should also consider the ability to configure cluster specific environment variables or even extra custom logic. |
Having started to consider this I think that there are some issues with describing all possible job hosts as "clusters". I have come to the view that I prefer the phrase "job platforms" which doesn't imply anything about whether we are running our jobs on a raspi0, or a desktop, or a cray. |
Yes, good. Thanks for the reminder @wxtim |
Ref: https://groups.google.com/forum/#!topic/cylc/dFVNTeyPcrs
We could make Cylc aware of the concept of a group of HPC login nodes. If the original job submit node goes down, we could try the alternative node(s) in the event that a poll or kill command fails with host not found.
The text was updated successfully, but these errors were encountered: