Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with remote cluster workers and a local scheduler #257

Closed
m-albert opened this issue Apr 3, 2019 · 7 comments
Closed

Working with remote cluster workers and a local scheduler #257

m-albert opened this issue Apr 3, 2019 · 7 comments

Comments

@m-albert
Copy link

m-albert commented Apr 3, 2019

Hi, I was wondering how to best manage remote workers using jobqueue.

I'm referring to the scenario in which I want the scheduler (and client) to run locally, but the workers to run on the cluster. Is this already supported (would it even fall into the intended use)?

If not, I was thinking to overwrite this code in core.py and send the command over SSH. Does that make sense?

    def _submit_job(self, script_filename):
        return self._call(shlex.split(self.submit_command) + [script_filename])

Actually, I'd also have to change stop_jobs. But I could avoid having to interact with the cluster environment by simply using the retire_workers method of the scheduler and set close=True.

Thanks for any hints!

@m-albert
Copy link
Author

m-albert commented Apr 3, 2019

Ok, I have it running now. Steps:

  1. start a local SLURMCluster
  2. copy a job script to the cluster (obtained from cluster.job_script())
  3. set
cluster.submit_command = "ssh a@login.cluster sbatch %s" %script_filepath
cluster.cancel_command = "ssh a@login.cluster scancel"

I have to copy the script manually because I couldn't figure out how to change the submit command so that subprocess.Popen executes a local script over ssh.

Finally, I had to change one line of dask_jobqueue source code:
In core.py line 448:

        self._call([self.cancel_command] + list(set(jobs)))

to

        self._call(shlex.split(self.cancel_command) + list(set(jobs)))

I'd suggest this change to be taken over since currently the JobQueueCluster.submit_command can be extended while JobQueueCluster.cancel_command cannot.

@guillaumeeb
Copy link
Member

Hi @quakberto, welcome here!

I'm referring to the scenario in which I want the scheduler (and client) to run locally, but the workers to run on the cluster. Is this already supported (would it even fall into the intended use)?

Dask-jobqueue is intended to be used from a node which has access to job queing system submit or cancel command. It also needs to be able to establish some tcp communications between the client node and the cluster.

I'm really glad you found a solution to your problem. For your copy problem, you could maybe call scp {submission_script} user@cluster:/tmp; ssh ..., as submit_command. Not sure if this would work.

I'd suggest this change to be taken over

Sounds reasonnable, would you open a PR ?

@lesteve
Copy link
Member

lesteve commented Apr 9, 2019

I opened a PR about adding shlex.split for self.cancel_command in #261.

@quakberto I'd be interested about your motivations regarding your original goal, i.e. local scheduler + client with remote workers. In other words, what is problematic with having the scheduler + client running on a node inside the cluster.

In most cluster configurations I have seen, the local scheduler (if it is outside the cluster as seems to be the case in your example) is unlikely to be able to connect by TCP to a worker node.

@m-albert
Copy link
Author

m-albert commented Apr 9, 2019

@guillaumeeb thanks a lot for the welcome and the clarifications and suggestions! I tried

scp {submission_script} user@cluster:/tmp; ssh ...

but it didn't work. I guess shlex.split does not properly handle ';' to create two commands. Apparently an alternative would be to use shell=True in subprocess.Popen.

@lesteve thanks for opening the PR, I'm pretty busy at the moment and was going to do it in the next days.

Regarding the motivation to run the scheduler locally: In the cluster environment at my institute the cluster nodes are accessible by TCP, probably this is indeed not so common. My reasons are mainly convenience, meaning that I could also run everything on the cluster. With convenience I mean mostly that I run my code in local jupyter notebooks and depending on cluster availability I can add cluster workers to my local scheduler.

This is especially convenient when having i/o within tasks. Despite tcp connectivity I cannot mount shares on cluster nodes. This means I would have to take care myself of organising the data staging. Instead, what I do is to have two types of workers connected to the local scheduler, namely cluster workers and local workers. If I then restrict the tasks containing i/o to the local workers (either using restrictions or resources), the serialisation within distributed does the staging for me.

I guess that in any case I could also have the scheduler on the cluster and connect a local client to it (and in the latter case local workers), but so far I don't experience problems with the configuration I described. To the contrary, the cluster network sometimes has problems and having the scheduler local keeps it responsive.

@lesteve
Copy link
Member

lesteve commented Apr 10, 2019

This is an interesting setup, I'd be very keen to see some code to understand better how you do it, in particular how you create your local scheduler and how you add remote workers to your local scheduler with dask-jobqueue.

@guillaumeeb
Copy link
Member

@quakberto I just merged the PR by @lesteve. Let us now if you need something else to ease your use case.

@guillaumeeb
Copy link
Member

Closing as I think the modification to the codebase needed by @quakberto has been made, and other part of the OP looks a bit uncommon to me.

Feel free to reopen if you think I'm wrong and/or this needs something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants