Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom CV strategy #380

Closed
drorhilman opened this issue Apr 19, 2021 · 29 comments
Closed

Custom CV strategy #380

drorhilman opened this issue Apr 19, 2021 · 29 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@drorhilman
Copy link

I have trained an automl model, where the ensemble seems to work well with the test set.
Thus, I want to try my own CV scheme in a 'leave one year out' way (removing a year, training on other years, and testing on selected year).

For this, I need to be able to re-train the ensemble again like an scikit-learn pipeline.
How can I retain the ensamble itself?
The '.fit' function does not seem to work like in sklearn estimator convention (getting a numpy array as input).

@pplonski
Copy link
Contributor

Hi @drorhilman,

Looks like there are two issues:

  1. Would you like to keep AutoML parameters not changed and train with other data? Or you want to keep only ensemble not changed? Right now there is no option to keep AutoML parameters and just retrain all models on other data.
  2. In AutoML fit() should work with numpy data, might be a bug.

@drorhilman
Copy link
Author

Hi @drorhilman,

Looks like there are two issues:

  1. Would you like to keep AutoML parameters not changed and train with other data? Or you want to keep only ensemble not changed? Right now there is no option to keep AutoML parameters and just retrain all models on other data.
  2. In AutoML fit() should work with numpy data, might be a bug.

thanks a lot!
I guess that I would like to do one of the options:

  1. either get the structure of the final ensemble, including the hyper-parameters and preprocessing, and somehow wrap it as a scikit-learn pipeline, where I can fit/predict in a specific way.
  2. Being able to run a custom CV scheme with the entire automl.fit workflow (where I provide the sets of train/test data) - so the final RMSE will reflect the custom-CV.

@pplonski
Copy link
Contributor

@drorhilman we can do both options:

  1. All details about models and preprocessing are saved in the model directory in framework.json and README.md files. But it might be a little inconvenient to manually recreate models.
  2. We can add option to run custom CV. User will define a vector with numbers indicating a fold. Proposition code:
# below defines 3 folds for 6 example samples 
folds_array = np.array([0,0,1,1,2,2]) 
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds_array})
automl.fit(X, y)

For such custom validation, I will enable Ensemble by default but I will disable models stacking. It will be user responsibility to enable stacking if it is applicable (don't have leakage between custom folds). @drorhilman does second option work for you?

@drorhilman
Copy link
Author

This sounds good. But I am not sure what the folds_array stands for: can I provide a list of indexes to be used as tests?
Something like:

test_indexes = [np.array([0,7,6,12,45]),  np.array([56,71,2,9,129]) ... ]  

Where the training loop iterates over the list, using the indexes as test-data and the rest as training data? So the reported RMSE score will be for this scheme.

@drorhilman
Copy link
Author

But actually, your suggested strategy can work as well (pretty easy to transfer between the two)

@pplonski
Copy link
Contributor

The folds_array will be a vector indicating the folds. It doesn't need to start from 0. You mentioned that you would like to use year as fold indicator. If you heave a year column in your data, then you will be able to pass year column as folds_array.

@pplonski
Copy link
Contributor

@drorhilman one more thing, do you have a time dependency in your data that you can't use future data for training? For example, you have data from 2015,2016,2017,2018 and when you use 2016 as a test fold then you can use only 2015 data for training (cant use 2017 and 2018 data). Maybe it will be good to add also explicit definition of train and test folds combination?

Code proposition:

folds = [2015, 2015, ..., 2016, 2016, ...]
# (train, test) splits definition for folds
splits = [
    ([2015], [2016]),
    ([2015, 2016], [2017]),
    ([2015, 2016, 2017], [2018])
]
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds, "splits": splits})
automl.fit(X, y)

If splits will be not defined, then each fold from folds will be used for testing and others as train.

Default split if splits undefined:

 [
    ([2016, 2017, 2018], [2015]),
    ([2015, 2017, 2018], [2016]),
    ([2015, 2016, 2018], [2017]),
    ([2015, 2016, 2017], [2018])
]

@drorhilman
Copy link
Author

This sounds right.

@pplonski pplonski changed the title How can evaluate the trained ensamble myself in a specialized CV strategy Custom CV strategy Apr 19, 2021
@pplonski pplonski self-assigned this Apr 19, 2021
@pplonski pplonski added the enhancement New feature or request label Apr 19, 2021
@pplonski pplonski added this to the 0.10.4 milestone Apr 19, 2021
@pplonski
Copy link
Contributor

@drorhilman I just found out that if it will be implemented in the similar way you suggested (based on array indices), then it will be compatible with all sklearn model_selection validation strategies: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

It will be similar to the cv parameter, like in GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html but it will only accept the list with (train, test) splits as arrays of indices.

@drorhilman
Copy link
Author

Thank you very much @pplonski . When it will be added?

@pplonski
Copy link
Contributor

@drorhilman I hope that today

pplonski added a commit that referenced this issue Apr 23, 2021
@pplonski
Copy link
Contributor

pplonski commented Apr 23, 2021

@drorhilman changes are in the dev branch. To install package from dev:

pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev

The example use case:

import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

from sklearn import datasets

X, y = datasets.make_classification(
    n_samples=100,
    n_features=5,
    n_informative=4,
    n_redundant=1,
    n_classes=2,
    n_clusters_per_class=3,
    n_repeated=0,
    shuffle=False,
    random_state=0,
)

X = pd.DataFrame(X)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)

folds = pd.Series(X_train.index % 4)

splits = [
    ([0], [1,2,3]),
    ([0,1], [2,3]),
    ([0,1,2], [3]),
]

# define train and validation indices
cv = []
for split in splits:
    train_indices = X_train.index[folds.isin(split[0])]
    validation_indices = X_train.index[folds.isin(split[1])]
    cv += [(train_indices, validation_indices)]


automl = AutoML(

    mode="Compete",
    algorithms=["Xgboost"], 
    eval_metric="accuracy",
    start_random_models=1,
    validation_strategy={
        "validation_type": "custom"
    }
)
automl.fit(X_train, y_train, cv=cv)

There is an additional cv argument in the fit(). If validation is set to custom validation_strategy={"validation_type": "custom"} then cv parameter is used for validation. The cv should have a list of tuples. Each tuple define train and validation indices.

For the custom validation, I switched off the stacking and boost-on-errors steps. Should be enabled by the user only.

@drorhilman looking for feedback, I hope it will work for you!

@drorhilman
Copy link
Author

Thanks. I will try this on Monday.

@DevDavey
Copy link

Thank you for adding this! I'll give it a shot tomorrow as well.

@drorhilman
Copy link
Author

I wanted to update you that it seems to work OK for me. thanks!

@pplonski
Copy link
Contributor

pplonski commented May 4, 2021

@drorhilman thank you for your feedback! Closing the issue.

@pplonski pplonski closed this as completed May 4, 2021
@wmotkowska-inpost
Copy link

wmotkowska-inpost commented May 29, 2024

@pplonski I think the folds should be defined as follows

folds = pd.Series(X_train.index // (len(X_train) / num_splits))

The previous folds create a sequence of [0,1,2,3,0,1,2,3] instead of quarters split.

@wmotkowska-inpost
Copy link

wmotkowska-inpost commented May 29, 2024

@pplonski Also I have an issue with this code

        X_train.reset_index(inplace=True, drop=True)
        y_train.reset_index(inplace=True, drop=True)
        num_splits = 3
        folds = pd.Series(X_train.index // (len(X_train) / num_splits))        
        splits = [
            ([0], [1,2]),
            ([0,1], [2]),
        ]
        cv = []
        for split in splits:
            train_indices = X_train.index[folds.isin(split[0])]
            validation_indices = X_train.index[folds.isin(split[1])]
            cv += [(train_indices, validation_indices)]

This outputs an error
ERROR Problem with custom validation. positional indexers are out-of-bounds
IndexError: index 18 is out of bounds for axis 0 with size 18
validation_data["sample_weight"] = sample_weight.iloc[validation_index]
in https://github.com/mljar/mljar-supervised/blob/master/supervised/validation/validator_custom.py

While I checked and the data should work:

CV splits: [(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64'),
Int64Index([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], dtype='int64')),
(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype='int64'),
Int64Index([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype='int64'))]
X_train: RangeIndex(start=0, stop=28, step=1)

@pplonski
Copy link
Contributor

Cześć @wmotkowska-inpost!

Thanks for reporting the issue. Could you please provide full code to reproduce this issue? Are you using sample weights in the training? There are no models trained at all?

@pplonski pplonski reopened this May 29, 2024
@wmotkowska-inpost
Copy link

wmotkowska-inpost commented May 29, 2024

I recreated it on your data with the custom split and model as in my implementation.

import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

from sklearn import datasets

X, y = datasets.make_classification(
    n_samples=100,
    n_features=5,
    n_informative=4,
    n_redundant=1,
    n_classes=2,
    n_clusters_per_class=3,
    n_repeated=0,
    shuffle=False,
    random_state=0,
)

X = pd.DataFrame(X)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)

num_splits = 3
folds = pd.Series(X_train.index // (len(X_train) / num_splits))
splits = [
    ([0], [1, 2]),
    ([0, 1], [2]),
]
cv = []
for split in splits:
    train_indices = X_train.index[folds.isin(split[0])]
    validation_indices = X_train.index[folds.isin(split[1])]
    cv += [(train_indices, validation_indices)]

automl = AutoML(
            random_state=123,
            algorithms=[
                "Baseline",
                "Linear",
                "Decision Tree",
                "Random Forest",
                "Extra Trees",
                "Xgboost",
                "LightGBM",
                "CatBoost",
                "Nearest Neighbors",
            ],
            ml_task="regression",
            start_random_models=3,
            golden_features=False,
            features_selection=True,
            hill_climbing_steps=3,
            top_models_to_improve=3,
            train_ensemble=True,
            explain_level=1,
            validation_strategy={"validation_type": "custom"},
            boost_on_errors=True,
            verbose=0,
        )

automl.fit(
            X_train,
            y_train,
            cv=cv
        )

predictions = automl.predict(X_test)

The predictions look like this:

array([0.59385095, 0.32481747, 0.08021932, 0.86107466, 0.80197426,
       0.54666968, 0.56548217, 0.27991819, 0.64660518, 0.86241633,
       0.54489969, 0.60784996, 1.00091712, 0.48493497, 0.34176998,
       0.66428256, 0.39565906, 0.30089507, 0.25014441, 0.45435476])

It has produced the ensemble model and predictions and the run has not stopped.

It is just confusing that i get this error info despite the fact it seems to work.

This is the info under the deployment.

Custom validation strategy
Split 0.
Train 27 samples.
Validation 53 samples.
Split 1.
Train 54 samples.
Validation 26 samples.
Drop features ['random_feature']
2024-05-29 16:24:02,437 supervised.exceptions ERROR Problem with custom validation. positional indexers are out-of-bounds

Traceback (most recent call last):
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1587, in _get_list_axis
    return self.obj._take_with_is_copy(key, axis=axis)
  File "c...\.venv\lib\site-packages\pandas\core\series.py", line 945, in _take_with_is_copy
    return self.take(indices=indices, axis=axis)
  File "c...\.venv\lib\site-packages\pandas\core\series.py", line 930, in take
    new_index = self.index.take(indices)
  File "...\.venv\lib\site-packages\pandas\core\indexes\base.py", line 1183, in take
    taken = algos.take(
  File "...\.venv\lib\site-packages\pandas\core\algorithms.py", line 1577, in take
    result = arr.take(indices, axis=axis)
IndexError: index 53 is out of bounds for axis 0 with size 53

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "...\.venv\lib\site-packages\supervised\validation\validator_custom.py", line 101, in get_split
    validation_data["sample_weight"] = sample_weight.iloc[validation_index]
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1616, in _getitem_axis
    return self._get_list_axis(key, axis=axis)
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1590, in _get_list_axis
    raise IndexError("positional indexers are out-of-bounds") from err
IndexError: positional indexers are out-of-bounds

@pplonski
Copy link
Contributor

pplonski commented Jun 3, 2024

Thank you @wmotkowska-inpost, I can reproduce the issue. The problem is with boost on errors step. It is a step in which we manipulate the sample weight for each row based on error in predictions in out-of-folds. The AutoML was working because only models from boost on errors were not trained. I'm working on fix.

@pplonski
Copy link
Contributor

pplonski commented Jun 3, 2024

@wmotkowska-inpost I released version 1.1.9 on PyPi with fix. Thank you for reporting the issue.

@pplonski pplonski closed this as completed Jun 3, 2024
@wmotkowska-inpost
Copy link

Hi @pplonski, thank you for finding the source of the issue. I have changed the version from 1.1.4 to 1.1.9 and get this info 'KNeighborsRegressorAlgorithm' object has no attribute 'classes_'
Problem during computing permutation importance. Skipping ...
#669

@pplonski
Copy link
Contributor

pplonski commented Jun 4, 2024

Hi @wmotkowska-inpost,

Thanks for reporting the issue, do you have the issue after updating scikit-learn to the latest version?

@wmotkowska-inpost
Copy link

scikit-learn = "1.5.0"
sktime = "^0.30.0"

@pplonski
Copy link
Contributor

pplonski commented Jun 4, 2024

the same?

@wmotkowska-inpost
Copy link

are there any another dependencies i should check?

@pplonski
Copy link
Contributor

pplonski commented Jun 4, 2024

no, thanks, looks like there is an issue on our side. The error message means that feature importance can't be produced for kNN models. We will fix it.

Are you able to perform analysis and build models despite this error? Do you get good results? May I ask what is your use case?

@wmotkowska-inpost
Copy link

Thanks :)

Yes,it seems that I get the full set of models and an ensemble, I also get the predictions from kNN but no permutation importance outputvfor this model. I can work with the current state. :) I use the automl for stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants