-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple cores per process/thread #181
Comments
Funnily enough, I think we just had a conversation on this point in #179. As you point out cluster = PBSCluster(cores=2, processes=2, memory='100GB',
resource_spec='select=1:ncpus=32:mem=100GB') or LSFCluster: cluster = LSFCluster(cores=2, processes=2, ncpus=32) It seems we should better document this though, and this seems useful, we were previously thinking of removing this possibility... |
saw on LSF you could select cpus, but I wasn't aware of the PBSCluster abilities to write out resource specs. To make things simpler, would it be possible to set the number of threads manually so that you would have It is worth considering this in a larger context where one might wish to start heterogeneous workers with labels where expanding |
I think there is something to do with the cores/processes/threads/ncpus/ressources_spec.. At least some standardisation between JobQueueCluster implementations. I'm not sure how it should look like though, we previously removed the use of It would be more than welcome if you could propose some simple design choices here so we can discuss it, and then if we agree on something, propose a PR! For the second part on the larger context, I'm not sure I get your point, but maybe it has another scope and should be discussed in another issue? |
I think my proposal would be similar to the discussed. Scanning through
I am not entirely sure what the mental model you want to convey so this project remains relatively easy for new users while being mutable for complex scenarios. But the above while cover our primary use case. I will hold off the second part for later. |
Ping @mrocklin @jhamman @willirath @lesteve for opinion on how to best handle this.
No it's thread, or maybe cores divided per process. We've got one running worker per worker process actually. |
It feels like what you want is an advanced use case, so I don't think
|
Replying to the above : I think resource_spec is by worker, not by task and it is specific to some queue systems. IMO we have three levels:
The problem is that the number of cores is supposed to be the number of tasks in our current model, ie threads_by_task=1. I propose two alternatives:
I suppose that when handling parallelism manually, one doesn't want to share a process with another task. I would be more in favor of a boolean than threads_by_task that adds complexity to the system and might break existing parallelism mechanisms if the handcrafted code is executed by a thread (not sure about that). |
@louisabraham it is not completely straightforward to follow what you are saying, and from what I understand it feels like dask.distributed is the right level to address some of your points (for example, I guess by task you mean a function that is used in Granted there are plenty of things that could be added/improved in dask-jobqueue and suggestions are always more than welcome! I would suggest we start from precise use cases like the one in the original post and see how we can tackle them. For anything non-trivial, my opinion that you will have to specify My personal opinion is that being able to easily create heterogeneous clusters would be the right direction in which to extend (but even then maybe some of this work may happen in dask.distributed). |
I didn't intend it this way.
Yes.
Totally agree. To achieve what the author wants with my proposal, the parameters are cluster = ANYCluster(cores=32, processes=2, one_task_per_process=True) What disturbs me in @guillaumeeb's immediate fix (it is the only way to do it currently,
is that there is no variable that can track the total number of CPUs allocated and that the code is scheduler specific. On the other hand, cores totals to the total number of tasks that can be executed concurrently. For the OP's problem: cluster = ANYCluster(cores=2, processes=2, threads_by_task=16) I think the question we want to ask now is: should |
We have this use case of heterogeneous clusters at our site. We have modular supercomputing, with pure CPU nodes, nodes with GPUs and KNL nodes all available. While this is heterogeneous architectures, we also have the use case of heterogeneous resources within clusters. We want the ability to self-manage some of the resources (see #84 (comment) for the use case). At present we simply use
This results in a submission script like
I don't know if the same approach would work with other schedulers. |
It seem ultimately that you do not want users touching
I would be hesitant to override the "core" definition with supercomputers as it is well defined there, where "tasks" is well defined in Dask. Heterogeneous clusters would difficult to tackle in a generic way I think. It might be worth adding a PS: Hey @ocaisa! |
Ok, then my proposal is the following: A Each dask worker receives the parameters Jobqueue allocates I think heterogeneous clusters are out of the topic. |
Yep I agree, we should address this in another issue if this is something that interests people. See #103 already open. For the other part of the rich discussion here, it's a complicated desgin choice, but I thin I'm more in favor of avoiding jobqueue system specific kwargs as much as possible. I think I would better like what @dgasmith proposes: having three kwargs:
For job scheduler specificities, we would rely only on I've two remaining concerns:
Any thought? |
@guillaumeeb I'm not sure to follow. In your proposal, isn't Maybe your |
Not always, no. In what I propose, your In my opinion, threads_per_tasks is application specific, and should not appear in dask-jobqueue. dask-jobqueue books scheduler resources, and then start dask distributed with a given set of parameters. Possibly less dask threads than reserved cores, but in some cases we've also seen more threads than reserved cores. |
I see and approve.
For the sake of clarity, I think we could do:
with This way, it becomes way simpler to explain what dask-jobqueue does: allocate |
+1 for the language on Agreed on allowing the flexibility of I would note that in an ideal world the ability to convey the |
I'm not totally familiar with how dask workers work, but from my understanding this kind of information could be stored in |
@dgasmith (or @louisabraham): would you provide some PR to illustrate this? This would probably be more easy for other to give opinion.
Only problem that I see with that is that we would replace |
@guillaumeeb Ok I can do this.
We can put a warning and use |
I've made the test with PBS, it takes the first defined value. So we might need (but in another issue/PR) to put |
I am still unconvinced that dask-jobqueue has to change to accommodate this use case. One reason I am not a big fan of the general direction taken by this issue: suppose some of your tasks are doing multi-threading internally but some of your tasks don't (e.g. in the OP case, once your C++ code has run you need to run additional analysis code in Python). Basically you have booked some resources that your dask-worker is unable to use during the Python analysis phase. I only recently realised that worker resources could be used for the original OP's use case, see this related SO question and answer. |
You are making the assumption that there is additional analysis that needs to be run and/or it takes a significant amount of computing time which is not always the case. The issue is not about changing your average case, but allowing more flexibility for the project. As you note dask.distributed has this functionality as shown in the SO question. However, the adaptive cluster components and ability to submit to PBS/Slurm/etc is quite appealing in this project and is not present in dask.distributed. |
About your second point, let me try to develop a bit more what I had in mind to see if that clarifies things a bit. What I would suggest is to do something along these lines: from dask.distributed import Client
from dask_jobqueue import FooCluster
cluster = FooCluster(cores=32, processes=1, extra=['--resources threads=32'])
client = Client(cluster)
n_workers = 10
cluster.scale(n_workers)
for arg in arg_list:
# resources will ensure that maximum two tasks run concurrently on the same dask-worker
client.submit(function_calling_multithreaded_cpp_code, arg, resources={threads: 16}) My understanding of the dask model (and the SO answer goes in the same direction) is that each task consumes one thread. One thing you can do if that's not the case is to restrict the number of concurrent tasks that can run on a |
Thanks @lesteve, I was not aware of this alternative solution that is really interesting. As a result, I'm really hesitant here, no strong opinion... On one hand I don't really like the fact that resources reservation is really scheduler specific in our API (but that's also a mirror of the reality), on the other hand, there are already two solutions available to the OP case, one being generic for all JobQueueCluster implementation. |
Using resources is indeed very nice. cluster = FooCluster(cores=32, processes=1, extra=['--resources threads=32']) does. I think If IMO we should let the user set nthreads and nprocs directly, or check if they are present in the |
@lesteve even if the change is not really required for the original problem, I like the idea because:
This could also simplify a possible implementation of #133. Your proposed solution is nice, but it could lead to misunderstanding between the I think I slightly lean towards a modification to harmonize JobQueueCluster implementation and allow specifying nthreads, but again, I'm not seeing this as fundamental. |
An issue with the proposed snippet is that you are unable to run thread-unsafe OMP tasks on 8 cores at 4 tasks per node which is easily do-able with dask-distributed (although granted, not in the cleanest way possible). Typically when we are running OMP processes, we are using dask to merely map a large number of computations without dependencies onto worker nodes without clogging the queue. Dask LocalCluster has the following parameters:
The ability to effect this set of parameters from Having distributed understand threads and processes or even if a task could be flagged as needing a private process would be great, but I strongly suspect a much more long-term project. In short, I would very much like to handle this via resources, but I do not believe it is currently possible unless I have some fundamental misunderstanding here. |
OK I think I get what you are saying, sorry for not getting it earlier: if you have multiple threads per dask workers you can limit the number of concurrent tasks on your worker but you can not control that each task is run in a separate process, which is a problem for thread-unsafe tasks. In the case where you have a thread-unsafe task, do I understand correctly that you are always going to use I am fine with adding a I am not very familiar with OpenMP I have to admit, so out-of curiosity how common would you say thread-unsafe programs are? |
I should separate OMP and thread-unsafe tasks as they are not always related. Thread-unsafe usually comes from the fact that some scientific code writes files that use PID as an identifier or have shared globals that should not, in fact, be shared. This stems from the days when these codes were actually separate binaries that were called via bash and data was passed in files or original design choices that were perhaps not advisable. You can use |
So to be clear
I don't feel this is great, but again happy to have this in either way. |
I'd like to see this advance as a PR, so that we can ask for advice of other maintainers.
I don't think this is true. In Dask, a worker is equivalent to on process. With dask-jobqueue (and distributed), we sometimes launch grouped workers (with So whatever we call the kwarg, it should be the equivalent of |
There may be some misunderstanding between the I would be happy to make a PR, but I do not seem to follow your naming conventions and worker/process mental models. Please state these conventions and how you would like this work and I can make the changes. |
This is definitly unclear. If I'm not mistaken, The convention for me is what you stated in #181 (comment) or with your example in #181 (comment). So going with an |
From #181 (comment):
I did some tests and it appears that This means that there is a way with resources to ensure that tasks are run in separate processes. Unless I am missing something, this should get rid of the concern about using resources with non-threadsafe tasks. This is a snippet to show what I mean in more details. I am using import os
import time
import threading
from pprint import pprint
from dask.distributed import Client, LocalCluster
from distributed.client import get_task_stream
cluster = LocalCluster(n_workers=3, threads_per_worker=2, resources={'slots': 1})
client = Client(cluster)
print('dashboard port:', client.scheduler_info()['services']['bokeh'])
def func():
time.sleep(1)
return (os.getpid(), threading.current_thread().name)
t0 = time.time()
with get_task_stream() as ts:
futures = [client.submit(func, pure=False, resources={'slots': 1}) for i in range(21)]
result = client.gather(futures)
print('time: ', int(time.time() - t0)) Looking at the timing (7 seconds is 21 / 3) and in the dashboard you can see that there are three tasks run concurrently: The thing is with the dashboard I don't think there is a way to figure out which row is corresponds to which worker process. So here is a little bit of hacky code to show that tasks were run indeed into different worker processes: # hacky code to show that there are three tasks running concurrently and
# that they run on a different worker port (i.e. different worker process)
worker_port_and_start_list = [
[each['worker'].split(':')[2], int(each['startstops'][0][1]) % 100]
for each in ts.data]
pprint(worker_port_and_start_list) Output (first column is the worker port, second column is the start time (rounded) of the task):
|
Thanks @lesteve. Generally speaking, I think there is no such thing as a So yes
But anyway, I still think the proposed modification could be useful. |
Sure. I mainly wanted to make the point that dask resources can be used for the original OP problem: execute a python function that manages multi-threading internally. This was not obvious to me for quite a long time and probably for others. It would be great to have confirmation that it actually works on "production" use cases rather than only on my toy example above. IMO if we add a
|
That is quite interesting that resources are per worker ( The fact this is not per node brings up some interesting questions about heterogenous resources. Say a 16 core CPU and you have tasks that require 12 and 4 cores respectively. I think this kind of heterogeneous cluster usage is a bit out of scope of Dask at the moment however (not really possible with simple thread/processes either). Simply mentioning this as something to think about and might go back to a I am still open to making a PR if this change is still wanted. Thank you for the many comments and discussion! |
This discussion gives me headaches! I think we are over thinking 🙂. At least we should document how the OP can be solved both with For the |
I'm getting headaches as well about this 😉 : I have a similar problem. Each one of my tasks (embarrassingly parallel) manages threading internally with OpenMP and needs 8 cores. I did start 2 workers using SLURMCluster(walltime='01:00:00', memory='7 GB',
job_extra=['--nodes=1', '--ntasks-per-node=1'], cores=8, processes=1) but when i then map my (OpenMP parallalized) function using |
I'm trying to ask a similar question here as well https://stackoverflow.com/questions/54469195/dask-joblib-ipyparallel-and-other-schedulers-for-embarrassingly-parallel-probl |
Can you try this and report back? from dask.distributed import Client
from dask_jobqueue import SLURMCluster
# through
cluster = SLURMCluster(cores=8, processes=1, memory='7GB',
# each dask worker declares it has an amount 1
# of the resource named "processes"
extra=['--resources processes=1'])
client = Client(cluster)
n_workers = 10
cluster.scale(n_workers)
# resources ensure that at most one task (i.e. function in `.submit`)
# will run on each dask worker
futures = [client.submit(function_calling_multithreaded_cpp_code, arg,
# each task declares that they need an amount 1
# of the resource named "processes"
resources={'processes': 1})
for arg in arg_list] |
@lesteve won't the problem be that the futures go out of scope. Do I have to do fire_and_forget? |
Sorry I edited my snippet above. You may be able to use |
@lesteve doesn't work when leaving the |
@lesteve |
IMO dask resources is the "dask way" of doing what you want. In particular, note resources is not at the cluster level but at the task (i.e. function in I completely agree that resources are not very easy to discover when you are new to dask. I'd be in favour of adding an example that explains how to execute a function using multi-threading in |
@lesteve I guess it makes sense if you want to be that fine-grained that different tasks can have different resource requests. I still don't understand what the thanks for your help btw! |
Right sorry, I am going to try to explain better (I'll edit my snippet above too to make it clearer): # each dask worker declare it has processes=1
cluster = SLURMCluster(cores=8, processes=1, memory='7GB', extra=['--resources processes=1'])
# each task (e.g. function submission) declare they need processes=1
# because each dask worker declared it had processes=1 this will ensure that
# at most one task runs on each dask worker
futures = [client.submit(function_calling_multithreaded_cpp_code, arg,
resources={'processes': 1}) You can look at http://distributed.dask.org/en/latest/resources.html for a more detailed explanation of dask resources. |
thank you!! that helped. So now I do adaptive scaling and it seems to only run on one worker. I 've had that before - where it doesn't seem to load-balance. Is this something that should happen or do I need to set this up? |
I don't think it is expected, can you open a new issue about this? |
I'm first trying to see if that is a problem with manual scaling as well. Ah ... the |
Okay, I'm going to close this discussion. It is two long already. I've created #231 as an outcome we all agree with. I'm still open if someone want to propose a PR to add |
We have a use case where we would like to dask.distributed to parallelize a python-bound C++ program that would work best if it could consume 8-32 threads depending on the problem size and will manage threading internally. Normally, a Dask worker is run on a node that has 32 cores with the worker using 2 processes at 1 thread each so that we can give each process 16 cores.
Looking at the code currently, it seems that
nthreads = ncores / nprocesses
without exceptions, is there a canonical way to change this so that we can orchestrate our normal Dask worker operation with dask-jobqueue?The text was updated successfully, but these errors were encountered: