You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using cuML with cudf.pandas, I (or code in third-party libraries) will sometimes call .values on a DataFrame/Series before sending it to a cuML operator. For example, the getting started example in the umap-learn documentation does this.
Because cuML provides input/output dtype consistency by default (which is great), with cudf.pandas active I end up getting a raw CuPy array out instead of a cudf.pandas wrapped proxy NumPy array. This can cause downstream issues, because I'm now "unexpectedly" using objects in device memory which can break downstream code. Any code that expects to get a NumPy array (or calls np.asarray, like scikit-learn) will now fail.
I'm not sure what the right longer-term path forward here is, but I think this is something we may want to think about in the general case (even if we only address this specifically for cuML, for now).
I've also seen similar issues when mixing cudf.pandas and cuML like the following:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=50)
File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cuml/model_selection/_split.py", line 342, in train_test_split
raise TypeError(
TypeError: X needs to be either a cuDF DataFrame, Series or a cuda_array_interface compliant array.
and
File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cudf/core/frame.py", line 403, in array
raise TypeError(
TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy()
When using cuML with cudf.pandas, I (or code in third-party libraries) will sometimes call
.values
on a DataFrame/Series before sending it to a cuML operator. For example, the getting started example in the umap-learn documentation does this.Because cuML provides input/output dtype consistency by default (which is great), with cudf.pandas active I end up getting a raw CuPy array out instead of a cudf.pandas wrapped proxy NumPy array. This can cause downstream issues, because I'm now "unexpectedly" using objects in device memory which can break downstream code. Any code that expects to get a NumPy array (or calls
np.asarray
, like scikit-learn) will now fail.I'm not sure what the right longer-term path forward here is, but I think this is something we may want to think about in the general case (even if we only address this specifically for cuML, for now).
cc @dantegd @quasiben @shwina @galipremsagar
The text was updated successfully, but these errors were encountered: