You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a classification dataset with a mixture of string and category features in a pandas dataframe, and this breaks down GAMA (see MRE below).
import openml
from sklearn.model_selection import train_test_split
import gama
if __name__ == '__main__':
did = 42530
data = openml.datasets.get_dataset(did)
X, y, _, _ = data.get_data(dataset_format='dataframe', target=data.default_target_attribute)
X = X[y.isnull() == False]
y = y[y.isnull() == False]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print("loaded data")
time_fold = 5*60
metric = 'accuracy'
clf = gama.GamaClassifier(max_total_time=time_fold,
random_state = 1,
scoring=metric,
n_jobs=1,
store='nothing')
clf.fit(X_train, y_train)
print("finished fit.")
proba_predictions = clf.predict_proba(X_test)
print("finished predictions test data.")
The error I get is
loaded data
Traceback (most recent call last):
File "mre_gama.py", line 39, in <module>
clf.fit(X_train, y_train)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/GamaClassifier.py", line 134, in fit
super().fit(x, y, *args, **kwargs)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
self.model = self._post_processing.post_process(
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 27, in post_process
return self._selected_individual.pipeline.fit(x, y)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/base.py", line 702, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 288, in fit
X = self._validate_input(X, in_fit=True)
File "/Users/marcoslpbueno/automlpy/gaenv/lib/python3.8/site-packages/sklearn/impute/_base.py", line 260, in _validate_input
raise new_ve from None
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'Midwest'
The problem is solved when I convert the string features (in this case, 0 and 22) to category. I would think it would be best if GAMA could do this automatically, since it is an apparently simple conversion.
The text was updated successfully, but these errors were encountered:
Thanks for raising the issue! This error stems from the assumption that since Dataframes provide type annotation (their dtype), GAMA expects this to be correct (use unannotated numpy otherwise). By providing an explicitly non-categorical feature (technically object), you go against this assumption. This raises an error (although a bad and late one (#132)) because GAMA can't work with an object type series.
If you want feature type inference consider passing the data in numpy format:
By design I think it is good to assume that the user is an expert on the data: they can help the AutoML system with data type annotation. However, expanding the interface to allow for inferring pandas object series if explicitly set (e.g. infer_objects=True) sound reasonable to me. What do you think?
Thanks for replying! Indeed by using your suggestion GAMA was able to finish without errors.
I think that adding a parameter like infer_objects=True makes a lot of sense, since the user might be unsure about the column types of the dataset (even when using dataframes) and/or do not want to be checking this.
I am using a classification dataset with a mixture of string and category features in a pandas dataframe, and this breaks down GAMA (see MRE below).
The error I get is
The problem is solved when I convert the string features (in this case, 0 and 22) to category. I would think it would be best if GAMA could do this automatically, since it is an apparently simple conversion.
The text was updated successfully, but these errors were encountered: