-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to start a Scheduler in a batch job #186
Comments
With #306 we've got a lot of what is needed to start a Scheduler in a dedicated batch job. I also believe we've got all what is needed in Distributed side too, as dask-kubernetes don't use a LocalCluster anymore, see dask/dask-kubernetes#162. |
Yes, everything should be set up. We probably need to abstract out the
FooJob classes to split between scheduler and workers. All of the
SpecCluster infrastructure is there though.
…On Fri, Oct 11, 2019 at 2:45 PM Guillaume Eynard-Bontemps < ***@***.***> wrote:
With #306 <#306> we've got a
lot of what is needed to start a Scheduler in a dedicated batch job.
I also believe we've got all what is needed in Distributed side too, as
dask-kubernetes don't use a LocalCluster anymore, see
dask/dask-kubernetes#162
<dask/dask-kubernetes#162>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#186?email_source=notifications&email_token=AACKZTBGJAU36PEH63LDH2DQODJVJA5CNFSM4GAH5VR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBBA2PI#issuecomment-541199677>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE5KM76TSNEMWMSWBLQODJVJANCNFSM4GAH5VRQ>
.
|
Just curious could you elaborate a bit more about why this would be useful. I have some guesses but I just want to make sure they are somewhat accurate.
|
In practice I doubt that the scheduler will be expensive enough that system administrators will care. They all ask about this, but I don't think that it will be important in reality. Another reason to support this is for networking rules. In some systems (10-20%?) compute nodes are unable to connect back to login nodes. So here placing the scheduler on a compute node, and then connecting to that node from the client/login node is nice. It may be though that this is a frequently requested but not actually useful feature. |
In #354 @orbitfold seems in this particular case (at least for some clusters he has access too). @mrocklin could you give a few pointers that would help getting started on implementing a remote scheduler (my understanding is that using |
We need to make an object that follows this interface: The simplest example today is probably SSHCluster: https://github.com/dask/distributed/blob/master/distributed/deploy/ssh.py But rather than start things with SSH, it would presumably submit a job. |
I am currently using dask-joblib on a PBS cluster and running the scheduler on the login node. It is indeed a bit problematic because the login node has only 2gb of memory and it quickly runs out if I am not careful with the size of computation graphs. So I think I would definitely benefit from this feature. |
Interesting, thanks for this use case! 2GB is certainly very small even more so when shared between all the cluster users. Are there some other nodes you can ssh to to do heavier work e.g. compilatiion of C++ code? On some clusters I am familiar with they are called One work-around in your use case is so start an interactive job where you launch your Dask scheduler i.e. you run your python script creating your |
The cluster I am using is very small and has only one type of compute node. I am not sure what you mean by an "interactive" job. Maybe what you say is that I should start the The current workaround for me is to use a scheduler file to set up the communication between scheduler and workers using the shared file system (without using Ideally I could run a python script on the login node that launches PBS jobs for both |
An interactive job means you submit a job through your job scheduler and you end up in an interactive shell on a compute node. Look at this for an example on a cluster that use PBS.
This is a mildly annoying restriction if you ask me, maybe try to talk to IT to see whether they would be ready to lift this restriction. People can still do it if they want using
I am guessing you mean https://docs.dask.org/en/latest/setup/hpc.html#using-a-shared-network-file-system-and-a-job-scheduler. This seems like a reasonable work-around.
I have had other people tell me a similar thing (third bullet point of #186 (comment)). If you manage to make that work, please let us know (ideally in a separate issue). |
Thank you! I didn't know this and it is very useful!
I'll see if I can talk to IT next year!
Yes this is what I meant. It does work well enough in my case because I do not need adaptive scaling of workers (yet). It is a little unfortunate to have to start the cluster separately (not from the notebook that I use for my computations). |
@muammar I see that you have commented in #390 (comment). Could you please explain the admin rules that are in place on your cluster just to get an idea what you are allowed to do on your cluster. You may be interested by my answer above: #186 (comment). Let me try to some up:
In both 1. and 2. you need to bear in mind that as soon as your scheduler job finishes, you will lose all your workers after ~60s. That may mean losing the result of lenghty computations. |
One of the goal of
ClusterManager
object is to be able to launch a remote scheduler. In dask-jobqueue scope, this probably means submitting a job which will start a Scheduler, and then connect to it.We probably still lacks some remote interface between
ClusterManager
andscheduler
object for this to work, so it will probably mean to extend APIs upstream.Identified Scheduler method to provide:
I suspect that adaptive will need to change significantly too, this will maybe lead to having a transitional adaptive logic in dask-jobqueue, and other remote function to add in scheduler.
This is in the scope of #170.
The text was updated successfully, but these errors were encountered: