Working with remote cluster workers and a local scheduler #257

m-albert · 2019-04-03T10:54:10Z

Hi, I was wondering how to best manage remote workers using jobqueue.

I'm referring to the scenario in which I want the scheduler (and client) to run locally, but the workers to run on the cluster. Is this already supported (would it even fall into the intended use)?

If not, I was thinking to overwrite this code in core.py and send the command over SSH. Does that make sense?

    def _submit_job(self, script_filename):
        return self._call(shlex.split(self.submit_command) + [script_filename])

Actually, I'd also have to change stop_jobs. But I could avoid having to interact with the cluster environment by simply using the retire_workers method of the scheduler and set close=True.

Thanks for any hints!

The text was updated successfully, but these errors were encountered:

m-albert · 2019-04-03T16:11:26Z

Ok, I have it running now. Steps:

start a local SLURMCluster
copy a job script to the cluster (obtained from cluster.job_script())
set

cluster.submit_command = "ssh a@login.cluster sbatch %s" %script_filepath
cluster.cancel_command = "ssh a@login.cluster scancel"

I have to copy the script manually because I couldn't figure out how to change the submit command so that subprocess.Popen executes a local script over ssh.

Finally, I had to change one line of dask_jobqueue source code:
In core.py line 448:

        self._call([self.cancel_command] + list(set(jobs)))

to

        self._call(shlex.split(self.cancel_command) + list(set(jobs)))

I'd suggest this change to be taken over since currently the JobQueueCluster.submit_command can be extended while JobQueueCluster.cancel_command cannot.

guillaumeeb · 2019-04-04T19:40:29Z

Hi @quakberto, welcome here!

I'm referring to the scenario in which I want the scheduler (and client) to run locally, but the workers to run on the cluster. Is this already supported (would it even fall into the intended use)?

Dask-jobqueue is intended to be used from a node which has access to job queing system submit or cancel command. It also needs to be able to establish some tcp communications between the client node and the cluster.

I'm really glad you found a solution to your problem. For your copy problem, you could maybe call scp {submission_script} user@cluster:/tmp; ssh ..., as submit_command. Not sure if this would work.

I'd suggest this change to be taken over

Sounds reasonnable, would you open a PR ?

lesteve · 2019-04-09T10:00:11Z

I opened a PR about adding shlex.split for self.cancel_command in #261.

@quakberto I'd be interested about your motivations regarding your original goal, i.e. local scheduler + client with remote workers. In other words, what is problematic with having the scheduler + client running on a node inside the cluster.

In most cluster configurations I have seen, the local scheduler (if it is outside the cluster as seems to be the case in your example) is unlikely to be able to connect by TCP to a worker node.

m-albert · 2019-04-09T12:33:08Z

@guillaumeeb thanks a lot for the welcome and the clarifications and suggestions! I tried

scp {submission_script} user@cluster:/tmp; ssh ...

but it didn't work. I guess shlex.split does not properly handle ';' to create two commands. Apparently an alternative would be to use shell=True in subprocess.Popen.

@lesteve thanks for opening the PR, I'm pretty busy at the moment and was going to do it in the next days.

Regarding the motivation to run the scheduler locally: In the cluster environment at my institute the cluster nodes are accessible by TCP, probably this is indeed not so common. My reasons are mainly convenience, meaning that I could also run everything on the cluster. With convenience I mean mostly that I run my code in local jupyter notebooks and depending on cluster availability I can add cluster workers to my local scheduler.

This is especially convenient when having i/o within tasks. Despite tcp connectivity I cannot mount shares on cluster nodes. This means I would have to take care myself of organising the data staging. Instead, what I do is to have two types of workers connected to the local scheduler, namely cluster workers and local workers. If I then restrict the tasks containing i/o to the local workers (either using restrictions or resources), the serialisation within distributed does the staging for me.

I guess that in any case I could also have the scheduler on the cluster and connect a local client to it (and in the latter case local workers), but so far I don't experience problems with the configuration I described. To the contrary, the cluster network sometimes has problems and having the scheduler local keeps it responsive.

lesteve · 2019-04-10T09:40:09Z

This is an interesting setup, I'd be very keen to see some code to understand better how you do it, in particular how you create your local scheduler and how you add remote workers to your local scheduler with dask-jobqueue.

guillaumeeb · 2019-04-13T19:57:01Z

@quakberto I just merged the PR by @lesteve. Let us now if you need something else to ease your use case.

guillaumeeb · 2019-05-09T20:08:13Z

Closing as I think the modification to the codebase needed by @quakberto has been made, and other part of the OP looks a bit uncommon to me.

Feel free to reopen if you think I'm wrong and/or this needs something else.

lesteve mentioned this issue Apr 9, 2019

Allow complex cancel_command with arguments with test. #261

Merged

guillaumeeb closed this as completed May 9, 2019

lesteve mentioned this issue Dec 22, 2019

Allow to start a Scheduler in a batch job #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Working with remote cluster workers and a local scheduler #257

Working with remote cluster workers and a local scheduler #257

m-albert commented Apr 3, 2019

m-albert commented Apr 3, 2019

guillaumeeb commented Apr 4, 2019

lesteve commented Apr 9, 2019

m-albert commented Apr 9, 2019

lesteve commented Apr 10, 2019

guillaumeeb commented Apr 13, 2019

guillaumeeb commented May 9, 2019

Working with remote cluster workers and a local scheduler #257

Working with remote cluster workers and a local scheduler #257

Comments

m-albert commented Apr 3, 2019

m-albert commented Apr 3, 2019

guillaumeeb commented Apr 4, 2019

lesteve commented Apr 9, 2019

m-albert commented Apr 9, 2019

lesteve commented Apr 10, 2019

guillaumeeb commented Apr 13, 2019

guillaumeeb commented May 9, 2019