Error with string features (pandas) #131

marcoslbueno · 2021-11-01T13:01:36Z

I am using a classification dataset with a mixture of string and category features in a pandas dataframe, and this breaks down GAMA (see MRE below).

import openml 
from sklearn.model_selection import train_test_split
import gama

if __name__ == '__main__':
    did = 42530
    data = openml.datasets.get_dataset(did)
    X, y, _, _ = data.get_data(dataset_format='dataframe', target=data.default_target_attribute)

    X = X[y.isnull() == False]
    y = y[y.isnull() == False] 

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    print("loaded data")
    
    time_fold = 5*60
    metric = 'accuracy'
    
    clf = gama.GamaClassifier(max_total_time=time_fold, 
                            random_state = 1,
                            scoring=metric, 
                            n_jobs=1, 
                            store='nothing')

    clf.fit(X_train, y_train)
    print("finished fit.")

    proba_predictions = clf.predict_proba(X_test)
    print("finished predictions test data.")

The error I get is

loaded data
Traceback (most recent call last):
  File "mre_gama.py", line 39, in <module>
    clf.fit(X_train, y_train)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/GamaClassifier.py", line 134, in fit
    super().fit(x, y, *args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
    self.model = self._post_processing.post_process(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 27, in post_process
    return self._selected_individual.pipeline.fit(x, y)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 288, in fit
    X = self._validate_input(X, in_fit=True)
  File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 260, in _validate_input
    raise new_ve from None
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'Midwest'

The problem is solved when I convert the string features (in this case, 0 and 22) to category. I would think it would be best if GAMA could do this automatically, since it is an apparently simple conversion.

The text was updated successfully, but these errors were encountered:

PGijsbers · 2021-11-01T14:45:35Z

Thanks for raising the issue! This error stems from the assumption that since Dataframes provide type annotation (their dtype), GAMA expects this to be correct (use unannotated numpy otherwise). By providing an explicitly non-categorical feature (technically object), you go against this assumption. This raises an error (although a bad and late one (#132)) because GAMA can't work with an object type series.

If you want feature type inference consider passing the data in numpy format:

- clf.fit(X_train, y_train)
+ clf.fit(X_train.values, y_train.values)

- proba_predictions = clf.predict_proba(X_test)
+ proba_predictions = clf.predict_proba(X_test.values)

By design I think it is good to assume that the user is an expert on the data: they can help the AutoML system with data type annotation. However, expanding the interface to allow for inferring pandas object series if explicitly set (e.g. infer_objects=True) sound reasonable to me. What do you think?

marcoslbueno · 2021-11-03T18:21:05Z

Thanks for replying! Indeed by using your suggestion GAMA was able to finish without errors.

I think that adding a parameter like infer_objects=True makes a lot of sense, since the user might be unsure about the column types of the dataset (even when using dataframes) and/or do not want to be checking this.

PGijsbers mentioned this issue Nov 1, 2021

Error in preprocessing pipeline may go undetected until after search #132

Open

PGijsbers added this to the v22.1+ milestone Jul 27, 2022

PGijsbers modified the milestones: v22.1+, v22.1 Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with string features (pandas) #131

Error with string features (pandas) #131

marcoslbueno commented Nov 1, 2021

PGijsbers commented Nov 1, 2021

marcoslbueno commented Nov 3, 2021

Error with string features (pandas) #131

Error with string features (pandas) #131

Comments

marcoslbueno commented Nov 1, 2021

PGijsbers commented Nov 1, 2021

marcoslbueno commented Nov 3, 2021