Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue creating custom configuration for multi output regression #810

Closed
robertritz opened this issue Nov 28, 2018 · 2 comments
Closed

Issue creating custom configuration for multi output regression #810

robertritz opened this issue Nov 28, 2018 · 2 comments

Comments

@robertritz
Copy link

Could you provide a bit more help with the custom configuration dictionary? I'm attempting to set up a simple custom configuration using the SelectFromModel example you gave. Here is my current config:

tpot_config = {
    'sklearn.multioutput.MultiOutputRegressor': {
        'estimator': {
            'sklearn.ensemble.ExtraTreesRegressor': {
                'n_estimators': [100],
                'max_features': np.arange(0.05, 1.01, 0.05)
            }
        }
    }
}

And here is my code to run TPOT:

pipeline_optimizer = TPOTRegressor(generations=5, population_size=20, max_time_mins=480, n_jobs=-1, verbosity=2, random_state=12345, config_dict=tpot_config)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline.py')

I receive an error:
ValueError: Error: Input data is not in a valid format. Please confirm that the input data is scikit-learn compatible. For example, the features must be a 2-D array and target labels must be a 1-D array.

Is it necessary to specify the parameters to search for each algorithm? Before reading the documentation and your example I naively just passed through a list of algorithms like so:

tpot_config = {
    'sklearn.multioutput.MultiOutputRegressor': {
        'estimator': ['ExtraTreesRegressor']
      }
}

There are sklearn algorithms that are inherently multioutput, but with MultiOutputRegressor I get many more options. Thanks!

@robertritz
Copy link
Author

Wow I completely misinterpreted the error message. The issue is not regarding my config at all, but the input data. I'm passing through a dataframe instead of an array. This is what happens when you don't get enough sleep.

@aayux
Copy link

aayux commented Dec 12, 2018

@robertritz how does passing an array resolve this error?

The error is raised in _check_dataset() when it catches an AssertionError or ValueError from sklearn's check_X_y() function (see linked).

For 2-dimensional y, the function would require an additional argument multi_output=True to work as expected. Or am I missing something here?

Here's a partial trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tpot/base.py in _check_dataset(self, features, target, sample_weight)
   1069             if target is not None:
-> 1070                 X, y = check_X_y(features, target, accept_sparse=True, dtype=np.float64)
   1071                 return X, y

/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    751     else:
--> 752         y = column_or_1d(y, warn=True)
    753         _assert_all_finite(y)

/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    787 
--> 788     raise ValueError("bad input shape {0}".format(shape))
    789 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants