You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My dataset is very small and contains at maximum 380 samples and about 800 features. Naturally, this will mean that the data is highly noisy and depending on the CV splits and also the random_state of the model. Through another issue i learned that fitted_pipeline_ is different from the best pipeline during the search process, so it cannot be used either.
My idea therefore was to write a custom scorer which does not use the training data to score but directly uses the fixed and separate validation set to generate the score. It also keeps track of the best pipeline and saves it independently in the trained state.
There are some issues, e.g., the scorer object being copied i guess when using n_jobs>1, but since my case is an edge case that's okay, i can use n_jobs=1.
However, i really would like to deactivate cross validation for my case, since i want the estimator to be trained on the full training data and then the scorer score it on the full validation data. Using CV with training / validation set indices would not make my validation set independent of the training anymore, which is what i want to avoid.
Is there a workaround to make this happen?
The text was updated successfully, but these errors were encountered:
I think you could use custom validation set(s) via cv parameter (similar to issue #767).
Please check the cv parameter in TPOT API. You can merge the validation set with the trainset for fitting in tpot_obj.fit(X, y) and then specify train/test splits via an iterable (see the example for GridSearchCV)
Context of the issue
My dataset is very small and contains at maximum 380 samples and about 800 features. Naturally, this will mean that the data is highly noisy and depending on the CV splits and also the random_state of the model. Through another issue i learned that fitted_pipeline_ is different from the best pipeline during the search process, so it cannot be used either.
My idea therefore was to write a custom scorer which does not use the training data to score but directly uses the fixed and separate validation set to generate the score. It also keeps track of the best pipeline and saves it independently in the trained state.
There are some issues, e.g., the scorer object being copied i guess when using n_jobs>1, but since my case is an edge case that's okay, i can use n_jobs=1.
However, i really would like to deactivate cross validation for my case, since i want the estimator to be trained on the full training data and then the scorer score it on the full validation data. Using CV with training / validation set indices would not make my validation set independent of the training anymore, which is what i want to avoid.
Is there a workaround to make this happen?
The text was updated successfully, but these errors were encountered: