-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom CV strategy #380
Comments
Hi @drorhilman, Looks like there are two issues:
|
thanks a lot!
|
@drorhilman we can do both options:
# below defines 3 folds for 6 example samples
folds_array = np.array([0,0,1,1,2,2])
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds_array})
automl.fit(X, y) For such custom validation, I will enable |
This sounds good. But I am not sure what the test_indexes = [np.array([0,7,6,12,45]), np.array([56,71,2,9,129]) ... ] Where the training loop iterates over the list, using the indexes as test-data and the rest as training data? So the reported RMSE score will be for this scheme. |
But actually, your suggested strategy can work as well (pretty easy to transfer between the two) |
The |
@drorhilman one more thing, do you have a time dependency in your data that you can't use future data for training? For example, you have data from 2015,2016,2017,2018 and when you use 2016 as a test fold then you can use only 2015 data for training (cant use 2017 and 2018 data). Maybe it will be good to add also explicit definition of train and test folds combination? Code proposition: folds = [2015, 2015, ..., 2016, 2016, ...]
# (train, test) splits definition for folds
splits = [
([2015], [2016]),
([2015, 2016], [2017]),
([2015, 2016, 2017], [2018])
]
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds, "splits": splits})
automl.fit(X, y) If Default split if [
([2016, 2017, 2018], [2015]),
([2015, 2017, 2018], [2016]),
([2015, 2016, 2018], [2017]),
([2015, 2016, 2017], [2018])
] |
This sounds right. |
@drorhilman I just found out that if it will be implemented in the similar way you suggested (based on array indices), then it will be compatible with all sklearn It will be similar to the |
Thank you very much @pplonski . When it will be added? |
@drorhilman I hope that today |
@drorhilman changes are in the pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev The example use case: import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn import datasets
X, y = datasets.make_classification(
n_samples=100,
n_features=5,
n_informative=4,
n_redundant=1,
n_classes=2,
n_clusters_per_class=3,
n_repeated=0,
shuffle=False,
random_state=0,
)
X = pd.DataFrame(X)
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
folds = pd.Series(X_train.index % 4)
splits = [
([0], [1,2,3]),
([0,1], [2,3]),
([0,1,2], [3]),
]
# define train and validation indices
cv = []
for split in splits:
train_indices = X_train.index[folds.isin(split[0])]
validation_indices = X_train.index[folds.isin(split[1])]
cv += [(train_indices, validation_indices)]
automl = AutoML(
mode="Compete",
algorithms=["Xgboost"],
eval_metric="accuracy",
start_random_models=1,
validation_strategy={
"validation_type": "custom"
}
)
automl.fit(X_train, y_train, cv=cv) There is an additional For the custom validation, I switched off the stacking and boost-on-errors steps. Should be enabled by the user only. @drorhilman looking for feedback, I hope it will work for you! |
Thanks. I will try this on Monday. |
Thank you for adding this! I'll give it a shot tomorrow as well. |
I wanted to update you that it seems to work OK for me. thanks! |
@drorhilman thank you for your feedback! Closing the issue. |
@pplonski I think the folds should be defined as follows
The previous folds create a sequence of [0,1,2,3,0,1,2,3] instead of quarters split. |
@pplonski Also I have an issue with this code X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
num_splits = 3
folds = pd.Series(X_train.index // (len(X_train) / num_splits))
splits = [
([0], [1,2]),
([0,1], [2]),
]
cv = []
for split in splits:
train_indices = X_train.index[folds.isin(split[0])]
validation_indices = X_train.index[folds.isin(split[1])]
cv += [(train_indices, validation_indices)] This outputs an error While I checked and the data should work: CV splits: [(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64'), |
Cześć @wmotkowska-inpost! Thanks for reporting the issue. Could you please provide full code to reproduce this issue? Are you using sample weights in the training? There are no models trained at all? |
I recreated it on your data with the custom split and model as in my implementation.import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn import datasets
X, y = datasets.make_classification(
n_samples=100,
n_features=5,
n_informative=4,
n_redundant=1,
n_classes=2,
n_clusters_per_class=3,
n_repeated=0,
shuffle=False,
random_state=0,
)
X = pd.DataFrame(X)
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
num_splits = 3
folds = pd.Series(X_train.index // (len(X_train) / num_splits))
splits = [
([0], [1, 2]),
([0, 1], [2]),
]
cv = []
for split in splits:
train_indices = X_train.index[folds.isin(split[0])]
validation_indices = X_train.index[folds.isin(split[1])]
cv += [(train_indices, validation_indices)]
automl = AutoML(
random_state=123,
algorithms=[
"Baseline",
"Linear",
"Decision Tree",
"Random Forest",
"Extra Trees",
"Xgboost",
"LightGBM",
"CatBoost",
"Nearest Neighbors",
],
ml_task="regression",
start_random_models=3,
golden_features=False,
features_selection=True,
hill_climbing_steps=3,
top_models_to_improve=3,
train_ensemble=True,
explain_level=1,
validation_strategy={"validation_type": "custom"},
boost_on_errors=True,
verbose=0,
)
automl.fit(
X_train,
y_train,
cv=cv
)
predictions = automl.predict(X_test) The predictions look like this:
It has produced the ensemble model and predictions and the run has not stopped.It is just confusing that i get this error info despite the fact it seems to work.This is the info under the deployment.
|
Thank you @wmotkowska-inpost, I can reproduce the issue. The problem is with |
@wmotkowska-inpost I released version |
Thanks for reporting the issue, do you have the issue after updating scikit-learn to the latest version? |
scikit-learn = "1.5.0" |
the same? |
are there any another dependencies i should check? |
no, thanks, looks like there is an issue on our side. The error message means that feature importance can't be produced for kNN models. We will fix it. Are you able to perform analysis and build models despite this error? Do you get good results? May I ask what is your use case? |
Thanks :) Yes,it seems that I get the full set of models and an ensemble, I also get the predictions from kNN but no permutation importance outputvfor this model. I can work with the current state. :) I use the automl for stacking. |
I have trained an automl model, where the ensemble seems to work well with the test set.
Thus, I want to try my own CV scheme in a 'leave one year out' way (removing a year, training on other years, and testing on selected year).
For this, I need to be able to re-train the ensemble again like an scikit-learn pipeline.
How can I retain the ensamble itself?
The '.fit' function does not seem to work like in sklearn estimator convention (getting a numpy array as input).
The text was updated successfully, but these errors were encountered: