-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for different $HOME on cylc job remote and execution nodes. #2779
Comments
@DamianAgius - just to check that I understand the problem, what is the relationship between the "ssh remote" and the "execution host"? Normally (well, in my experience anyway) it would be a login node (or similar) that sees the same home filesystem as the compute nodes. In your case are both hosts on the same shared filesystem but have different home directories, or different filesystem and different home directories?? If the "ssh remote" and the "execution host" do see the same filesystem, would it suffice to tell PBS the full - instead of relative - path to the desired job log location? (this would also require a change to Cylc, btw). |
In this case, the SSH remote is a boundary node (not login) to multiple systems - effectively a suite setup & job submission proxy. |
Roger that. So, the use case (partly from offline conversation) could be summarized as: The ssh remote is a single "boundary node" that fronts several HPC clusters with different home filesystems. |
@DamianAgius - a further clarification request: does the PBS client on the boundary node put jobs on the different HPCs (with different home filesystems) based purely on resource requested by the jobs? Because if users have to be aware of which HPC host to target then - I have to ask, before we consider modifying cylc for this - is a separate remote for each of the two different HPCs not a simpler option? (VMs are cheap and easy....) |
@matthewrmshin - as the architect of recent cylc job subsystem improvements - is probably best placed to comment on the implications for cylc, if we have to support a single remote that fronts multiple different HPCs. |
@hjoliver Each suite will be able to submit to one or more of the clusters, by:
This has been tested and works as expected. We separate the HPC clusters via DNS alias, which allows current Cylc configuration to work. It would be nice if Cylc supported not having to set up DNS alias or round-robin for each cluster, but this is not essential. |
Can we ask why there is a necessity to have alternate HOME file systems for each cluster? |
There is not strictly a necessity, however it has certain benefits and is convenient, especially if:
In this instance, the decision was made some time ago to have different home file systems - we will review the decision. |
Understood. It is certainly an interesting design to have multiple clusters sharing the same front end host. |
@matthewrmshin - so, is it fair to say this is not a trivial fix and therefore needs to wait on your improved cluster awareness work ... probably after the higher priorities that are now spinning up? (web architecture, authentication, and GUI...) @DamianAgius - can you confirm you have a workaround for the current setup?? |
p.s. @DamianAgius - I don't think you answered the 2nd part of my question above: #2779 (comment) |
Hi @hjoliver #2199 would help as we'll migrate most remote-host-based settings to become cluster-based settings. If it is important enough to solve this, we can in theory raise the priority of #2199 (at least partially) - the change should be mostly orthogonal to those associated with the web UI work - but it will distract the team (when it comes to reviewing and testing the changes, etc.) |
@hjoliver Sorry for the delayed response to #2779 (comment) - I was on a weeks leave. We are already using separate 'remote' configurations for each cluster, but these boundary nodes are not VMs - they are HPC nodes, with multiple file systems mounted to allow cross-cluster data transfers,and are also acting, for each cluster, as both external interfaces to non-HPC data sources and as the Cylc SSH 'remotes'. Extra info:
|
Just re-read this issue. @DamianAgius - as per your initial description above, everything (apart from job log retrieval?) works properly if cylc cd's to the cluster home dir location before doing the qsub? It would be an easy change to make cylc do that, even it is just a temporary workaround. Testing this sort of change will be painful though... |
Yes, although we are working on how to set up a test environment (for integration testing, not in-built Cylc tests of which I have little knowledge). We manually tested qsub'ing a very simple script to a cluster, from $HOME on the boundary node
Which worked - the job ran and PBS copied the output files back correctly. Cylc does seem to use relative path when setting up job log paths and then when trying to copy the job log back to the suite server VM - is there a way of configuring Cylc to use the full path for the job output files? I like easy changes, and am happy to test with a workaround. |
Just talked to @DamianAgius. He envisages that the two HPC clusters (with different home paths) can continue to be accessed via two cylc remotes (that happen to resolve to the same physical "boundary node", but cylc doesn't need to know that). And, as described above, the different home paths for both clusters are visible on the boundary node. So in cylc, we would just have to add a per-remote "home directory" configuration to be used instead of |
After further discussion with @DamianAgius there's one more complication here: PBS has to be told the target cluster for job query or kill to work, so batch system support will need mods. (We have a similar issue with a heterogeneous Slurm cluster here).
|
Update: it turns out:
Therefore, Rose PR submitted to resolve this issue: metomi/rose#2252 |
(Only true of PBS 14+) For PBS 13 (still needed at @DamianAgius's site for a bit longer) I've posted #2877 |
We have an issue where we are trying to submit jobs from a suite server, via a SSH remote (which does the qsub), to an execution host. Both the SSH remote and the execution host have access to all the required filesystems and PBS, however $HOME on the SSH remote is not the same $HOME as on the execution host.
Cylc uses a relative path for PBS output files, and submits the job from $HOME on the SSH remote - and therefore even though the correct log directory exists on the execution host, PBS cannot copy the output files to the job output directory, as it tries to copy to the directory path given by $HOME on the SSH remote.
Everything works up to the point where PBS attempts to do that copy, so the cylc_run directory, work directory, run directory etc are created in the correct locations on the shared file system because we the rose configuration specifies the correct directories. However the PBS directives in the cylc job files that specify the job output and error destinations are based on the directory the qsub occurs from.
Possible fix
We have tested and determined if your qsub into the directory (on the SSH remote) that represents $HOME on the execution host, the job runs successfully.
Therefore if there was a configuration option to specify the qsub starting point, this would be great. Eg, the code would do something equivalent to the following, assuming $JOB_SUBMIT_DIR was our configured directory:
This will then allow the job to run, and the output file to be copied back by PBS to the correct location on the execution host, which is visible to the SSH remote.
By default $JOB_SUBMIT_DIR would be "$HOME" - but we would like an optional configuration item, per remote, that would be the directory used for the QSUB.
We also note that we would need to use this directory (or possibly another) for the job log retrieval, as the job logs exist on the SSH remote, but in $JOB_SUBMIT_DIR/cylc-run rather than $HOME/cylc-run (and therefore currently they cannot be copied back to the suite server).
The text was updated successfully, but these errors were encountered: