Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KeyError: 'dtype' and AttributeError: DataFrame object has no attribute dtype when using cuML's ColumnTransformer with cuDF DataFrame] #6183

Open
allisond-nvidia opened this issue Dec 16, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@allisond-nvidia
Copy link

allisond-nvidia commented Dec 16, 2024

Describe the bug
When using cuml's ColumnTransformer with a cuDF.DataFrame as input, the function raises a KeyError: 'dtype' and subsequently an AttributeError: 'DataFrame' object has no attribute dtype. This issue seems to occur when a SimpleImputer is used within the ColumnTransformer pipeline on a cuDF.DataFrame.

Steps/Code to reproduce bug
Below is a minimal reproducible example:

from cuml.compose import ColumnTransformer
from cuml.preprocessing import SimpleImputer

num_imputer = SimpleImputer(strategy = 'constant', fill_value = 0)
cat_imputer = SimpleImputer(strategy = 'constant', fill_value = 'None', missing_values = pd.NA)
mode_imputer = SimpleImputer(strategy = 'most_frequent', missing_values = pd.NA)

preprocessor = ColumnTransformer(
    transformers = [
        ('num', num_imputer, num_cols_zero),
        ('cat', cat_imputer, cat_cols_none),
        ('mod', mode_imputer, cols_mode)
    ]
)

processed_data = preprocessor.fit_transform(all_data_tofill_copy)

processed_df = pd.DataFrame(processed_data, columns = num_cols_zero + cat_cols_none + cols_mode)

processed_df[num_cols_zero] = processed_df[num_cols_zero].astype('float64')
processed_df[cat_cols_none + cols_mode] = processed_df[cat_cols_none + cols_mode].astype('object')

processed_df.head() 

Expected behavior
The ColumnTransformer should process the cuDF.DataFrame correctly without raising errors. Missing values should be imputed using the defined SimpleImputer strategies, and the output should be consistent with the column definitions in the ColumnTransformer.

Actual behavior
KeyError: 'dtype'
AttributeError: 'DataFrame' object has no attribute dtype

**Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 20.04 amd64
  • GPU Model/Driver: NVIDIA A100 and driver 470.57
  • CUDA Version: 11.8
  • Method of cuML install: conda
    • conda list output
      Output from conda list truncated for brevity
      cudf 23.02
      cuml 23.02

Error traceback

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.venv/lib/python3.12/site-packages/cudf/utils/utils.py:228, in GetAttrGetItemMixin.__getattr__(self, key)
    227 try:
--> 228     return self[key]
    229 except KeyError:

File ~/.venv/lib/python3.12/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking.<locals>.wrapper(*args, **kwargs)
     44     stack.enter_context(
     45         nvtx.annotate(
     46             message=func.__qualname__,
   (...)
     49         )
     50     )
---> 51 return func(*args, **kwargs)

File ~/.venv/lib/python3.12/site-packages/cudf/core/dataframe.py:1347, in DataFrame.__getitem__(self, arg)
   1346 if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
-> 1347     out = self._get_columns_by_label(arg)
   1348     if is_scalar(arg):

File ~/.venv/lib/python3.12/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking.<locals>.wrapper(*args, **kwargs)
     44     stack.enter_context(
     45         nvtx.annotate(
     46             message=func.__qualname__,
   (...)
     49         )
     50     )
---> 51 return func(*args, **kwargs)

File ~/.venv/lib/python3.12/site-packages/cudf/core/frame.py:358, in Frame._get_columns_by_label(self, labels)
    353 """
    354 Returns columns of the Frame specified by `labels`.
    355 
    356 Akin to cudf.DataFrame(...).loc[:, labels]
    357 """
--> 358 return self._from_data_like_self(self._data.select_by_label(labels))

File ~/.venv/lib/python3.12/site-packages/cudf/core/column_accessor.py:401, in ColumnAccessor.select_by_label(self, key)
    400         return self._select_by_label_with_wildcard(key)
--> 401 return self._select_by_label_grouped(key)

File ~/.venv/lib/python3.12/site-packages/cudf/core/column_accessor.py:563, in ColumnAccessor._select_by_label_grouped(self, key)
    562 def _select_by_label_grouped(self, key: abc.Hashable) -> Self:
--> 563     result = self._grouped_data[key]
    564     if isinstance(result, column.ColumnBase):
    565         # self._grouped_data[key] = self._data[key] so skip validation

KeyError: 'dtype'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
File ~/.venv/lib/python3.12/site-packages/cudf/pandas/fast_slow_proxy.py:928, in _fast_slow_function_call(func, *args, **kwargs)
    927 fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
--> 928 result = func(*fast_args, **fast_kwargs)
    929 if result is NotImplemented:
    930     # try slow path

File ~/.venv/lib/python3.12/site-packages/cudf/pandas/fast_slow_proxy.py:26, in call_operator(fn, args, kwargs)
     25 def call_operator(fn, args, kwargs):
---> 26     return fn(*args, **kwargs)

File ~/.venv/lib/python3.12/site-packages/cudf/utils/utils.py:230, in GetAttrGetItemMixin.__getattr__(self, key)
    229 except KeyError:
--> 230     raise AttributeError(
    231         f"{type(self).__name__} object has no attribute {key}"
    232     )

AttributeError: DataFrame object has no attribute dtype

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[65], line 19
      9 mode_imputer = SimpleImputer(strategy = 'most_frequent', missing_values = pd.NA)
     11 preprocessor = ColumnTransformer(
     12     transformers = [
     13         ('num', num_imputer, num_cols_zero),
   (...)
     16     ]
     17 )
---> 19 processed_data = preprocessor.fit_transform(all_data_tofill_copy)
     21 processed_df = pd.DataFrame(processed_data, columns = num_cols_zero + cat_cols_none + cols_mode)
     23 processed_df[num_cols_zero] = processed_df[num_cols_zero].astype('float64')

File ~/.venv/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188, in _make_decorator_function.<locals>.decorator_function.<locals>.decorator_closure.<locals>.wrapper(*args, **kwargs)
    185     set_api_output_dtype(output_dtype)
    187 if process_return:
--> 188     ret = func(*args, **kwargs)
    189 else:
    190     return func(*args, **kwargs)

File ~/.venv/lib/python3.12/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py:894, in ColumnTransformer.fit_transform(self, X, y)
    891 self._validate_column_callables(X)
...
   6297 ):
   6298     return self[name]
-> 6299 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'dtype'

Additional context
This issue seems to be caused by an incompatibility between cuDF.DataFrame and the underlying sklearn-based SimpleImputer, which expects a pandas-like interface. Specifically:

The SimpleImputer attempts to access the dtype attribute, which is not directly available for cuDF.DataFrame.
The issue propagates due to the missing handling in the ColumnTransformer.

@allisond-nvidia allisond-nvidia added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 16, 2024
@wphicks
Copy link
Contributor

wphicks commented Dec 16, 2024

Confirmed this is a cuML bug. Repros with Pandas alone.

@wphicks wphicks removed the ? - Needs Triage Need team to review and classify label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants