-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[dask-on-ray] ValueError on read-only memory #10124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @clarkzinzow, do you have any ideas? Also cc @ericl. |
@stephanie-wang Just ran the reproduction script and it worked for me off commit 7ffb37f711f0045f1d1199db139e31e58fcbbe00. In [1]: import ray
...: import dask
...: import dask.dataframe as dd
...: import pandas as pd
...: import numpy as np
...: from ray.experimental.dask import ray_dask_get
...:
...: import time
...:
...: N_ROWS = 1000
...: N_PARTITIONS = 100
...:
...: ray.init()
...:
...: df = dd.from_pandas(pd.DataFrame(np.random.randn(N_ROWS, 2), columns=['a', 'b']), npartitions=N_PARTITIONS)
...: start = time.time()
...: print(df.groupby('b').a.sum().compute(scheduler=ray_dask_get))
...: end = time.time()
...: print("ray", end - start)
...:
2020-08-15 00:26:50,891 INFO resource_spec.py:250 -- Starting Ray with 14.7 GiB memory available for workers and up to 7.36 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-15 00:26:51,499 INFO services.py:1215 -- View the Ray dashboard at localhost:8265
b
-1.135363 -0.535185
-1.075399 -0.164216
-0.661416 0.511669
-0.222915 -0.929146
0.416142 -0.553446
...
0.597348 0.610069
0.619706 -1.589376
0.723611 -2.171919
1.380745 0.196464
1.572565 0.296262
Name: a, Length: 1000, dtype: float64
ray 4.355887413024902 Dask collections generally don't allow in-place mutation, that shouldn't even be possible to express in a Dask graph, so I'm wondering what you're running into here... 🤔 |
I can reproduce, I believe it's due to some argument being passed to the task wrapper that is presented as a readonly Ray object. It's probably specific to particular dask/pandas/numpy versions. |
Do you mind giving your versions of Numpy, Dask and Pandas? I'm on Python 3.7.5 and:
|
Ah thanks for the traceback @ericl! |
dask==2.22.0 |
It looks like this is a recurring Pandas bug, most recent manifestation: pandas-dev/pandas#31710. |
Also this. It looks like the older bug was fixed for 0.24+, and the more recent bug was fixed for 1.1+. Edit: Wow, lots of these. |
Hmm I still see the issue for 1.1.0. Unless pandas has fixed itself to handle read-only arrays properly in all cases, we might have to add defensive copies for all numpy arrays passed as arguments to tasks. dask==2.22.0 |
@stephanie-wang as a workaround you can try this patch:
|
Upgrading to most recent Dask still works, upgrading Pandas breaks it. I'm guessing that this is another case in which Pandas isn't handling read-only arrays properly.
That seems terribly expensive, but until Pandas fixes this, that might be required. Although I'm pretty sure that this issue is exclusive to dataframes, so we'd only have to copy Pandas dataframe arguments; we've hit most of the NumPy surface with the Dask-Ray scheduler and have never hit this error. |
For this particular case, I think that we just need to add |
Hmm yeah, if we insert a copy only for pandas dataframes, that doesn't seem so bad (not sure if the copy will add that much overhead in the first place). We can also make it a config flag. |
I tested it out with I don't think we'll be seeing const fused type memory views in Cython 0.x anytime soon, so that nature of a fix won't be available in Pandas anytime soon. |
Note that this particular instance of this issue was fixed in Pandas 1.1.2: Would it be acceptable to close this with the recommendation that anyone encountering this issue on Pandas 1.{0,1}.x upgrade to 1.1.2+? |
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): 0.9dev
I was trying out the new Ray backend plugin for Dask and ran into a read-only error. Looks like the Dask code tries to write in-place, but Ray doesn't allow it since objects are immutable.
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
The text was updated successfully, but these errors were encountered: