-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSFCluster worker doesn't execute all threads on the same node #172
Comments
I think no core maintainer of this project uses LSF. But looking at the PR which introduced this was indeed identified: #78 (comment). But one way or another, it did not get implemented. It would be nice to submit a PR to fix this! |
Ok, I'll do it this week. |
@louisabraham I'd like to make a small release by the end of the week before starting to work on bigger changes. I'd like to take this fix in. |
I'll do it today. My editor autoformats the code, and it seems the different files don't use the same formatter. Is black like dask_ml OK? |
Also, currently I would need some explanation about the difference between |
We don't have code format template or requirement yet, except flake8 checks. I don't know black formatter, but should be ok. For cores vs ncpus, could you point to the corresponding lines in the code? |
Cores is not used in Other arguments are used like this: dask-jobqueue/dask_jobqueue/core.py Lines 231 to 233 in b510bb1
PBSCluster uses them like this: dask-jobqueue/dask_jobqueue/pbs.py Lines 84 to 89 in b510bb1
And LSFCluster: dask-jobqueue/dask_jobqueue/lsf.py Lines 85 to 94 in b510bb1
I just find the whole process not clear. When one looks in the documentation of PBSCluster or LSFCluster, they see very different interface and want to use Also, dask-jobqueue/dask_jobqueue/core.py Lines 244 to 249 in b510bb1
I might not have understood how dask_worker works, but I am under the impression each worker is supposed to be launched individually on only one computer, so one should ensure that all the allocated CPUs are on the same node. For example, if you set Also, can you confirm to me that |
Maybe the points above should be explained in the documentation. |
We launch one command dask-worker per job. But this actually might result in several worker processes if using processes > 1. Grouped workers in dask are somewhat hard to get right at first and can be tricky to understand. We already had several discussions about this. That's also why we need to use
That is correct, hence this issue for LSF.
That I think is not correct. |
Why do we have both then?
My example did not cover all the cases I am interested in.
I don't think it parallelizes it automatically, does it? (I suppose here that no dask datastructure is used because I didn't look up how they are implemented). I actually have the same questions with the threads: if you set |
Oh, I think I got it right by rereading for the nth time http://distributed.dask.org/en/latest/worker.html Your code isn't supposed to create more processes. So I think in most cases, when you use a cluster, you are supposed to launch with If the user spawns more processes (with multiprocessing) during the computation I think they will not count as |
I did not look at LSFScheduler code earlier.
Yes, with dask-jobqueue this is our assumption. I think at first this is mainly because PBS Pro is less versatile than Slurm or LSF, and pretty much imposes this limitation if we want to leave things simple.
Yes your right, your code has to create tasks, through various dask APIs, and in some cases it can also benefit from pandas or numpy multithreading optimizations, though I do not know how dask threads are linked to OMP_NUM_THREADS or other equivalent threading configuration. This would be something to ask upstream or on Stack Overflow if no answer is to be found in the docs. |
Yes that's exactly that, your code should not spawn new processes.
I'm not sure I understand what you mean. There are two extremes:
A good practice is to have
Using Dask, a user should never do this. Dask basically act (for one part of it) as a multi node multiprocessing library, no need to use another parallelization module within it. See also http://docs.dask.org/en/latest/setup/single-machine.html#single-machine-scheduler, http://distributed.dask.org/en/latest/efficiency.html#adjust-between-threads-and-processes |
Ok, then I got things wrong on #176 because I let the user choose how many processes should be executed on the same node. It is very simple to fix.
Yes, indeed I don't know what dask threads really are.
Precisely in this case, I don't see why one would set
I can see examples where one wants to do that. If you have large datasets and want to do computations without copying the data between the processes, you have to use the multiprocessing.sharedctypes module. |
But maybe what you mean is that you would only use job scheduler with only one ncpus per job allocation? In this case, you're putting some charge on the job scheduler, especially when you begin to scale to undreds or thousands of cores, and using dask-jobqueue adaptivity. It's often better to pack workers into bigger jobs, even if it means more waiting time. Generally, I agree that dask-jobqueue is great to take holes left by other jobs into the overall cluster resources, but I seldom use jobs allocation with less than 4 ncpus.
did not know this one, nice! |
I think we don't have the same experience with cluster computing :)
Maybe it will get integrated at some point in dask.distributed? But if you have already paid a network transfer, a simple copy in RAM is nothing. |
Okay, then I understand 😁. In the end, dask-jobqueue allows adressing many kinds of needs or cluster constraints! |
I am not 100% sure, but it seems to me that nothing forces a LSF task to execute all the threads on the same node.
For example, PBSCluster uses the option
ncpus:n
that requests cpus per node, and SLURMCluster specifies-n 1
and uses--cpus-per-task=n
to allocate n cpus on each host.However, LSFCluster uses the
-n
option. I think that without a--span[hosts=1]
option LSF can use processors from different hosts. See the relevant documentation there.Do I ignore some internals of LSF that make this acceptable?
The text was updated successfully, but these errors were encountered: