-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with remote cluster workers and a local scheduler #257
Comments
Ok, I have it running now. Steps:
cluster.submit_command = "ssh a@login.cluster sbatch %s" %script_filepath
cluster.cancel_command = "ssh a@login.cluster scancel" I have to copy the script manually because I couldn't figure out how to change the submit command so that subprocess.Popen executes a local script over ssh. Finally, I had to change one line of dask_jobqueue source code: self._call([self.cancel_command] + list(set(jobs))) to self._call(shlex.split(self.cancel_command) + list(set(jobs))) I'd suggest this change to be taken over since currently the |
Hi @quakberto, welcome here!
Dask-jobqueue is intended to be used from a node which has access to job queing system submit or cancel command. It also needs to be able to establish some tcp communications between the client node and the cluster. I'm really glad you found a solution to your problem. For your copy problem, you could maybe call
Sounds reasonnable, would you open a PR ? |
I opened a PR about adding @quakberto I'd be interested about your motivations regarding your original goal, i.e. local scheduler + client with remote workers. In other words, what is problematic with having the scheduler + client running on a node inside the cluster. In most cluster configurations I have seen, the local scheduler (if it is outside the cluster as seems to be the case in your example) is unlikely to be able to connect by TCP to a worker node. |
@guillaumeeb thanks a lot for the welcome and the clarifications and suggestions! I tried
but it didn't work. I guess shlex.split does not properly handle ';' to create two commands. Apparently an alternative would be to use @lesteve thanks for opening the PR, I'm pretty busy at the moment and was going to do it in the next days. Regarding the motivation to run the scheduler locally: In the cluster environment at my institute the cluster nodes are accessible by TCP, probably this is indeed not so common. My reasons are mainly convenience, meaning that I could also run everything on the cluster. With convenience I mean mostly that I run my code in local jupyter notebooks and depending on cluster availability I can add cluster workers to my local scheduler. This is especially convenient when having i/o within tasks. Despite tcp connectivity I cannot mount shares on cluster nodes. This means I would have to take care myself of organising the data staging. Instead, what I do is to have two types of workers connected to the local scheduler, namely cluster workers and local workers. If I then restrict the tasks containing i/o to the local workers (either using restrictions or resources), the serialisation within distributed does the staging for me. I guess that in any case I could also have the scheduler on the cluster and connect a local client to it (and in the latter case local workers), but so far I don't experience problems with the configuration I described. To the contrary, the cluster network sometimes has problems and having the scheduler local keeps it responsive. |
This is an interesting setup, I'd be very keen to see some code to understand better how you do it, in particular how you create your local scheduler and how you add remote workers to your local scheduler with dask-jobqueue. |
@quakberto I just merged the PR by @lesteve. Let us now if you need something else to ease your use case. |
Closing as I think the modification to the codebase needed by @quakberto has been made, and other part of the OP looks a bit uncommon to me. Feel free to reopen if you think I'm wrong and/or this needs something else. |
Hi, I was wondering how to best manage remote workers using jobqueue.
I'm referring to the scenario in which I want the scheduler (and client) to run locally, but the workers to run on the cluster. Is this already supported (would it even fall into the intended use)?
If not, I was thinking to overwrite this code in
core.py
and send the command over SSH. Does that make sense?Actually, I'd also have to change
stop_jobs
. But I could avoid having to interact with the cluster environment by simply using theretire_workers
method of the scheduler and setclose=True
.Thanks for any hints!
The text was updated successfully, but these errors were encountered: