-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify resources for dask builtin functions #2127
Comments
Hrm, so this works fine for me both on master and latest release In [1]: from dask.distributed import Client, LocalCluster
...: import pandas as pd
...: import dask.dataframe as dd
...: cluster = LocalCluster(processes=False)
...: cpu_worker = cluster.workers[0]
...: cpu_worker.name = 'cpu'
...: cpu_worker.set_resources(CPU=80)
...: client = Client(cluster)
...: pdf = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
...: test_df = dd.from_pandas(pdf, npartitions=2)
...: test_df.compute(resources = {tuple(test_df.__dask_keys__()): {'CPU': 1}})
...:
Out[1]:
a b
0 1 4
1 2 5
2 3 6 I might also suggest the following test which sets up resources and names when creating the workers and verifies that tasks are allocated appropriately by checking the structured log. from dask.distributed import Client, LocalCluster
import pandas as pd
import dask.dataframe as dd
cluster = LocalCluster(n_workers=0, processes=False)
client = Client(cluster)
alice = cluster.start_worker(resources={'CPU': 80}, name='alice')
bob = cluster.start_worker(name='bob')
pdf = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
ddf = dd.from_pandas(pdf, npartitions=2)
ddf.compute(resources = {tuple(ddf.__dask_keys__()): {'CPU': 1}})
assert alice.log
assert not bob.log |
The exception is odd. If you were using something other than LocalCluster I would guess that you had a version mismatch between your workers or between you workers and client, but given that everything is local I don't see how this could be. How did you install Dask? I don't suppose you can provide a conda environment.yml or something similar that reproduces the problem? (my guess would be that this is challenging, but thought I'd ask anyway) |
I was on Dask 0.17.2 and just confirmed the exception issue is resolved when I upgraded to Dask 0.18.1. Thanks! I'm planning on chaining together a number of functions, is there any way to specify the resources when calling the functions as opposed to when calling |
I agree that that would be valuable but currently no, resources are
specific to the distributed scheduler, while collections like dask.delayed
and dask.dataframe are scheduler agnostic. This is something that could be
improved though. I don't know how at the moment, but there is likely a
better way around this.
…On Wed, Jul 18, 2018 at 3:26 PM, Keith Kraus ***@***.***> wrote:
I was on Dask 0.17.2 and just confirmed the exception issue is resolved
when I upgraded to Dask 0.18.1. Thanks!
I'm planning on chaining together a number of functions, is there any way
to specify the resources when calling the functions as opposed to when
calling .compute?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2127 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszIOzmEzbsOnuP4NcL34IiTrulFxIks5uH4vbgaJpZM4VVH3R>
.
|
So for
|
Hrm, short term list(test.dask) would probably serve your needs. This
would include *all* keys that are used to create this dataset
…On Wed, Jul 18, 2018 at 4:07 PM, Keith Kraus ***@***.***> wrote:
So for dd.read_csv if I call __dask_keys__() it only returns the
from-delayed tasks while it looks like there's also pandas_read_text and
read-block tasks which end up getting scheduled on the GPU workers. Is
there a different function or a snippet which given an object returns every
key that we need to define the resources for?
I.E.
test = dd.read_csv("/path/to/some/file")
resources = {tuple(test.getallkeys()): {'CPU': 1}}
test.compute()
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2127 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszFA91sIOquOWIBwdGJQffTk_57ETks5uH5VmgaJpZM4VVH3R>
.
|
Hmm, I'd expect the following to work but it's still scheduling tasks on the GPU workers including the
|
Hrm, can you try passing compute(optimize_graph=False) ?
…On Wed, Jul 18, 2018 at 4:18 PM, Keith Kraus ***@***.***> wrote:
Hmm, I'd expect the following to work but it's still scheduling tasks on
the GPU workers including the from-delayed tasks as well:
test = dd.read_csv("/path/to/some/file")
resources = {tuple(test.dask): {'CPU': 1}}
test.compute()
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2127 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszAFCCwkpG1u4PnNiPQVbJij8XpqKks5uH5gtgaJpZM4VVH3R>
.
|
Still the same behavior. (Note my above example forgot to specify resources in the compute call but I am in fact setting it while testing) |
I'll take a look sometime today. |
OK, it looks like this is failing to support tuple-based keys in the Short term you could do this as a workaround: result = client.compute(df).result() My apologies for the dust here. Most users of |
If you use |
@mrocklin Unfortunately I have some pretty hard time constraints for what I'm working on where creating 8 dask workers with a single GPU visible is working well enough for my needs currently, but I'll hopefully have time to revisit this late next week to continue troubleshooting with you towards a solution. Apologies for the delay! |
It's just fine. This has been a useful exercise to flush out some bugs, but technical bugs and usability bugs, with using resources with collections. Good luck! |
I'm trying to specify resources for builtin dask functions such a
dd.read_csv
, with an end goal of running certain functions on "CPU workers" and other functions on "GPU workers". Here's a minimal example of trying to forcedd.read_csv
to run only on my "CPU worker":This returns the following:
It would be great if you could specify resources as you create tasks as opposed to when computing them, similar to how you can with
client.submit
I.E.The text was updated successfully, but these errors were encountered: