Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

Closed
beckernick opened this issue Feb 23, 2024 · 2 comments · Fixed by #5861
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@beckernick
Copy link
Member

beckernick commented Feb 23, 2024

When using cuML with cudf.pandas, I (or code in third-party libraries) will sometimes call .values on a DataFrame/Series before sending it to a cuML operator. For example, the getting started example in the umap-learn documentation does this.

Because cuML provides input/output dtype consistency by default (which is great), with cudf.pandas active I end up getting a raw CuPy array out instead of a cudf.pandas wrapped proxy NumPy array. This can cause downstream issues, because I'm now "unexpectedly" using objects in device memory which can break downstream code. Any code that expects to get a NumPy array (or calls np.asarray, like scikit-learn) will now fail.

I'm not sure what the right longer-term path forward here is, but I think this is something we may want to think about in the general case (even if we only address this specifically for cuML, for now).

import cudf.pandas
cudf.pandas.install()
import pandas as pd
from cuml.manifold import umap

df = pd.DataFrame({"a":range(5), "b":range(5)})
reducer = umap.UMAP()
print(type(reducer.fit_transform(df)))

reducer = umap.UMAP()
print(type(reducer.fit_transform(df.values)))
<class 'pandas.core.frame.DataFrame'>
<class 'cupy.ndarray'>
import cudf.pandas
cudf.pandas.install()
import pandas as pd
from cuml.manifold import umap
import matplotlib.pyplot as plt

df = pd.DataFrame({"a":range(5), "b":range(5)})
reducer = umap.UMAP()
transformed = reducer.fit_transform(df.values)
print(type(transformed))

plt.scatter(
    transformed[:, 0],
    transformed[:, 1],
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 12
      9 transformed = reducer.fit_transform(df.values)
     10 print(type(transformed))
---> 12 plt.scatter(
     13     transformed[:, 0],
     14     transformed[:, 1],
     15 )

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/pyplot.py:3699](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/pyplot.py#line=3698), in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs)
   3680 @_copy_docstring_and_deprecators(Axes.scatter)
   3681 def scatter(
   3682     x: float | ArrayLike,
   (...)
   3697     **kwargs,
   3698 ) -> PathCollection:
-> 3699     __ret = gca().scatter(
   3700         x,
   3701         y,
   3702         s=s,
   3703         c=c,
   3704         marker=marker,
   3705         cmap=cmap,
   3706         norm=norm,
   3707         vmin=vmin,
   3708         vmax=vmax,
   3709         alpha=alpha,
   3710         linewidths=linewidths,
   3711         edgecolors=edgecolors,
   3712         plotnonfinite=plotnonfinite,
   3713         **({"data": data} if data is not None else {}),
   3714         **kwargs,
   3715     )
   3716     sci(__ret)
   3717     return __ret

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/__init__.py:1465](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/__init__.py#line=1464), in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1462 @functools.wraps(func)
   1463 def inner(ax, *args, data=None, **kwargs):
   1464     if data is None:
-> 1465         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1467     bound = new_sig.bind(ax, *args, **kwargs)
   1468     auto_label = (bound.arguments.get(label_namer)
   1469                   or bound.kwargs.get(label_namer))

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/axes/_axes.py:4652](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/axes/_axes.py#line=4651), in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)
   4649 x, y = self._process_unit_info([("x", x), ("y", y)], kwargs)
   4650 # np.ma.ravel yields an ndarray, not a masked array,
   4651 # unless its argument is a masked array.
-> 4652 x = np.ma.ravel(x)
   4653 y = np.ma.ravel(y)
   4654 if x.size != y.size:

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:6852](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=6851), in _frommethod.__call__(self, a, *args, **params)
   6849     args = list(args)
   6850     a, args[0] = args[0], a
-> 6852 marr = asanyarray(a)
   6853 method_name = self.__name__
   6854 method = getattr(type(marr), method_name, None)

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:8132](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=8131), in asanyarray(a, dtype)
   8130 if isinstance(a, MaskedArray) and (dtype is None or dtype == a.dtype):
   8131     return a
-> 8132 return masked_array(a, dtype=dtype, copy=False, keep_mask=True, subok=True)

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:2820](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=2819), in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
   2811 """
   2812 Create a new masked array from scratch.
   2813 
   (...)
   2817 
   2818 """
   2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
   2821                  order=order, subok=True, ndmin=ndmin)
   2822 _baseclass = getattr(data, '_baseclass', type(_data))
   2823 # Check that we're not erasing the mask.

File cupy/_core/core.pyx:1475, in cupy._core.core._ndarray_base.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

cc @dantegd @quasiben @shwina @galipremsagar

@beckernick beckernick added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 23, 2024
@quasiben
Copy link
Member

I've also seen similar issues when mixing cudf.pandas and cuML like the following:

    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=50)
  File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cuml/model_selection/_split.py", line 342, in train_test_split
    raise TypeError(
TypeError: X needs to be either a cuDF DataFrame, Series or                             a cuda_array_interface compliant array.

and

File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cudf/core/frame.py", line 403, in array
raise TypeError(
TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy()

@betatim
Copy link
Member

betatim commented Apr 23, 2024

I've had a look at this, will open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants