[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

beckernick · 2024-02-23T14:54:13Z

When using cuML with cudf.pandas, I (or code in third-party libraries) will sometimes call .values on a DataFrame/Series before sending it to a cuML operator. For example, the getting started example in the umap-learn documentation does this.

Because cuML provides input/output dtype consistency by default (which is great), with cudf.pandas active I end up getting a raw CuPy array out instead of a cudf.pandas wrapped proxy NumPy array. This can cause downstream issues, because I'm now "unexpectedly" using objects in device memory which can break downstream code. Any code that expects to get a NumPy array (or calls np.asarray, like scikit-learn) will now fail.

I'm not sure what the right longer-term path forward here is, but I think this is something we may want to think about in the general case (even if we only address this specifically for cuML, for now).

import cudf.pandas
cudf.pandas.install()
import pandas as pd
from cuml.manifold import umap

df = pd.DataFrame({"a":range(5), "b":range(5)})
reducer = umap.UMAP()
print(type(reducer.fit_transform(df)))

reducer = umap.UMAP()
print(type(reducer.fit_transform(df.values)))
<class 'pandas.core.frame.DataFrame'>
<class 'cupy.ndarray'>

import cudf.pandas
cudf.pandas.install()
import pandas as pd
from cuml.manifold import umap
import matplotlib.pyplot as plt

df = pd.DataFrame({"a":range(5), "b":range(5)})
reducer = umap.UMAP()
transformed = reducer.fit_transform(df.values)
print(type(transformed))

plt.scatter(
    transformed[:, 0],
    transformed[:, 1],
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 12
      9 transformed = reducer.fit_transform(df.values)
     10 print(type(transformed))
---> 12 plt.scatter(
     13     transformed[:, 0],
     14     transformed[:, 1],
     15 )

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/pyplot.py:3699](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/pyplot.py#line=3698), in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs)
   3680 @_copy_docstring_and_deprecators(Axes.scatter)
   3681 def scatter(
   3682     x: float | ArrayLike,
   (...)
   3697     **kwargs,
   3698 ) -> PathCollection:
-> 3699     __ret = gca().scatter(
   3700         x,
   3701         y,
   3702         s=s,
   3703         c=c,
   3704         marker=marker,
   3705         cmap=cmap,
   3706         norm=norm,
   3707         vmin=vmin,
   3708         vmax=vmax,
   3709         alpha=alpha,
   3710         linewidths=linewidths,
   3711         edgecolors=edgecolors,
   3712         plotnonfinite=plotnonfinite,
   3713         **({"data": data} if data is not None else {}),
   3714         **kwargs,
   3715     )
   3716     sci(__ret)
   3717     return __ret

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/__init__.py:1465](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/__init__.py#line=1464), in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1462 @functools.wraps(func)
   1463 def inner(ax, *args, data=None, **kwargs):
   1464     if data is None:
-> 1465         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1467     bound = new_sig.bind(ax, *args, **kwargs)
   1468     auto_label = (bound.arguments.get(label_namer)
   1469                   or bound.kwargs.get(label_namer))

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/axes/_axes.py:4652](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/matplotlib/axes/_axes.py#line=4651), in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)
   4649 x, y = self._process_unit_info([("x", x), ("y", y)], kwargs)
   4650 # np.ma.ravel yields an ndarray, not a masked array,
   4651 # unless its argument is a masked array.
-> 4652 x = np.ma.ravel(x)
   4653 y = np.ma.ravel(y)
   4654 if x.size != y.size:

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:6852](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=6851), in _frommethod.__call__(self, a, *args, **params)
   6849     args = list(args)
   6850     a, args[0] = args[0], a
-> 6852 marr = asanyarray(a)
   6853 method_name = self.__name__
   6854 method = getattr(type(marr), method_name, None)

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:8132](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=8131), in asanyarray(a, dtype)
   8130 if isinstance(a, MaskedArray) and (dtype is None or dtype == a.dtype):
   8131     return a
-> 8132 return masked_array(a, dtype=dtype, copy=False, keep_mask=True, subok=True)

File [/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py:2820](http://10.117.16.184:8881/lab/tree/raid/nicholasb/dev/workflows/0006/raid/nicholasb/miniconda3/envs/rapids-24.02/lib/python3.10/site-packages/numpy/ma/core.py#line=2819), in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
   2811 """
   2812 Create a new masked array from scratch.
   2813 
   (...)
   2817 
   2818 """
   2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
   2821                  order=order, subok=True, ndmin=ndmin)
   2822 _baseclass = getattr(data, '_baseclass', type(_data))
   2823 # Check that we're not erasing the mask.

File cupy/_core/core.pyx:1475, in cupy._core.core._ndarray_base.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

cc @dantegd @quasiben @shwina @galipremsagar

The text was updated successfully, but these errors were encountered:

quasiben · 2024-04-19T17:16:01Z

I've also seen similar issues when mixing cudf.pandas and cuML like the following:

    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=50)
  File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cuml/model_selection/_split.py", line 342, in train_test_split
    raise TypeError(
TypeError: X needs to be either a cuDF DataFrame, Series or                             a cuda_array_interface compliant array.

and

File "/datasets/bzaitlen/miniconda3/envs/foobar/lib/python3.10/site-packages/cudf/core/frame.py", line 403, in array
raise TypeError(
TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy()

betatim · 2024-04-23T15:39:35Z

I've had a look at this, will open a PR.

beckernick added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 23, 2024

betatim mentioned this issue Apr 24, 2024

Add dedicated handling for cudf.pandas wrapped Numpy arrays #5861

Merged

2 tasks

rapids-bot bot closed this as completed in #5861 May 2, 2024

rapids-bot bot closed this as completed in 5754ec4 May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

beckernick commented Feb 23, 2024 •

edited

Loading

quasiben commented Apr 19, 2024

betatim commented Apr 23, 2024

[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

[FEA] Enhance cuML/cudf.pandas interop such that cudf.pandas wrapped numpy arrays don't cause cuML operators to return raw CuPy arrays #5784

Comments

beckernick commented Feb 23, 2024 • edited Loading

quasiben commented Apr 19, 2024

betatim commented Apr 23, 2024

beckernick commented Feb 23, 2024 •

edited

Loading