Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask_ml.decomposition.PCA: ValueError with data > 1 TB #592

Closed
demaheim opened this issue Dec 10, 2019 · 6 comments
Closed

dask_ml.decomposition.PCA: ValueError with data > 1 TB #592

demaheim opened this issue Dec 10, 2019 · 6 comments

Comments

@demaheim
Copy link
Contributor

demaheim commented Dec 10, 2019

When I do dask_ml.decomposition.PCA().fit(x), where the array x has a size > 1 TB, I get the error ValueError: output array is read-only.

I use

dask-ml                   1.1.1
distributed               2.9.0

The script

from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask_ml.decomposition import PCA
import dask.array as da

cluster = SLURMCluster()
nb_workers = 58
cluster.scale(nb_workers)
client = Client(cluster)
client.wait_for_workers(nb_workers)

x = da.random.random((1000000, 140000), chunks=(100000, 2000))
pca = PCA(n_components=64)
pca.fit(x)

gives the error

Traceback (most recent call last):
  File "value_error.py", line 48, in <module>
    pca.fit(x)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
    self._fit(X)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 338, in _fit
    raise e
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
    singular_values,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2573, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1873, in gather
    asynchronous=asynchronous,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 768, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1729, in _gather
    raise exception.with_traceback(traceback)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py", line 516, in svd_flip
    v *= signs[:, np.newaxis]
ValueError: output array is read-only

Note that

  • If I use x = da.random.random((1000000, 130000), chunks=(100000, 2000)) (1.0 TB), the error does not appear.
  • When I look at the dashboard, the PCA seems to run fine and the error appears at the very end of the computation.
  • I temporarily fixed the error in extmath.py by changing
def svd_flip(u, v, u_based_decision=True):
    if u_based_decision:
        # columns of u, rows of v
        max_abs_cols = np.argmax(np.abs(u), axis=0)
        signs = np.sign(u[max_abs_cols, range(u.shape[1])])
        u *= signs
-        v *= signs[:, np.newaxis]
+        v_copy = np.copy(v)
+        v_copy *= signs[:, np.newaxis]
+        return u, v_copy
    else:

I think this is not a good fix because I assume that the array v is blocked by another function.
Is there another way to fix the error?

@TomAugspurger
Copy link
Member

Do you know why the array is readonly? I don't immediately see a reason why the size of the data would matter, but I may be missing something.

@mrocklin
Copy link
Member

There may have been some Cython thing at some point. I can't remember who brought this up originally. @jakirkham were you involved in this?

@demaheim
Copy link
Contributor Author

I also don't understand why the array is readonly.
Can you reproduce the error on your HPC?
Is this maybe related to dask/distributed#1978?

@TomAugspurger
Copy link
Member

Can you reproduce the error on your HPC?

I don't have access to an HPC machine.

dask/distributed#1978 does sound related. Does ensuring that all your dependencies are built against Cython 0.28 or newer fix things?

dask/distributed#1978 (comment) is using PCA as well. Let's continue the discussion over there.

@jakirkham
Copy link
Member

@jakirkham were you involved in this?

IDK about involved. 😉 We did discuss a similar issue before that Tom has referenced.

@jakirkham
Copy link
Member

I also don't understand why the array is readonly.

Because we send bytes over the wire and bytes are immutable.

In [1]: memoryview(b"abc").readonly                                             
Out[1]: True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants