-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing a distributed.Future to the kwargs of apply_ufunc should resolve the future #6803
Comments
This is still an issue. Is this the case for |
I think I may have narrowed down the problem to a limitation in dask using dask_gateway. If passing a Future to a worker, the worker will try to unpickle that Future, and as part of that unpickle the Client object passed when creating such Future. Unfortunately, in a dask_gateway context the client is behind a |
test_future is not a dask collection. It's a distributed.Future, which points to an arbitrary, opaque data blob that xarray has no means to know about. FWIW, I could reproduce the issue, where the future in the kwargs is not resolved to the data it points to as one would expect. import distributed
import xarray
client = distributed.Client(processes=False)
x = xarray.DataArray([1, 2]).chunk()
test_future = client.scatter("Hello World")
def f(d, test):
print(test)
return d
y = xarray.apply_ufunc(
f,
x,
dask='parallelized',
output_dtypes="float64",
kwargs={'test':test_future},
)
y.compute() Expected print output: |
I can add that this problem is augmented in a dask_gateway system where the task just fails. With My interpretation is that the Future is resolved at the worker (or in case of apply_ufunc a thread of this worker) and embeds a reference to the Client object. This last however uses a gateway connection that is not understood by the worker as generally is the scheduler dealing with those |
Having said the above, your design is... contrived. There isn't, as of today, a straightforward way to scatter a local dask collection ( Workaround: test = np.full((20,), 30)
a = da.from_array(test)
dsk = client.scatter(dict(a.dask), broadcast=True)
a = da.Array(dsk, name=a.name, chunks=a.chunks, dtype=a.dtype, meta=a._meta, shape=a.shape)
a_x = xarray.DataArray(a, dims=["new_z"]) Once you have a_x, you just pass it to the args (not kwargs) of apply_ufunc. |
I'm not sure I understand the code above. In my case I have an array of approximately 300k elements that each and every function call needs to have access. That is why I tried to send the dataset to the cluster beforehand using scatter, but I cannot resolve the Future at the workers |
new_data_future = xr.apply_ufunc(
_copy_test,
data,
a_x,
...
) instead of using kwargs. I've opened dask/distributed#7140 to simplify this. With it implemented, my snippet test = np.full((20,), 30)
a = da.from_array(test)
dsk = client.scatter(dict(a.dask), broadcast=True)
a = da.Array(dsk, name=a.name, chunks=a.chunks, dtype=a.dtype, meta=a._meta, shape=a.shape)
a_x = xarray.DataArray(a, dims=["new_z"]) would become test = np.full((20,), 30)
a_x = xarray.DataArray(test, dims=["new_z"]).chunk()
a_x = client.scatter(a_x) |
I will try that. Thanks for opening that issue. I do feel there is the need to revisit scatter functionality and role particularly around dynamic clusters. Having a better look at your initial comment, that may still work if you call |
Not sure there's anything actionable here |
I think this thread is related to my problem, but not 100% sure. I have a single xarray dataset (holding multiple dataarrays) which I want to load into worker memory across a dask cluster, and then I do a bunch of different computations on the same data. I guess it's up to dask to work out how it wants to distribute the chunks across worker memory, but one scheme I imagine could be each worker loads N_chunks / N_workers number of chunks for each dataarray in the dataset. e.g. if there are 5 dataarray's in the dataset and each dataaray is 20 chunks, if there are 10 workers then each worker would load into memory 2 chunks from each dataarray. Is this, or something like it, what a simple
(Note I have not done anything more than The data loading seems pretty slow, wondering if I should be heeding this warning and using |
What is your issue?
I am trying to scatter an large array and pass it as keyword argument to a function applied using
apply_ufunc
but that is currently not working.The same function works if providing the actual array, but if providing the Future linked to the scatter data the task fails.
Here is a minimal example to reproduce this issue
I tried different versions of this, going from explicitly calling
test.result()
to change the way the Future was passed, but nothing worked.I also tried to raise exceptions within the function and various way to print information, but that also did not work. This last issue makes me think that if passing a Future I actually don't get to the scope of that function
Am I trying to do something completely silly? or is this an unexpected behavior?
The text was updated successfully, but these errors were encountered: