Custom CV strategy #380

drorhilman · 2021-04-19T07:23:22Z

I have trained an automl model, where the ensemble seems to work well with the test set.
Thus, I want to try my own CV scheme in a 'leave one year out' way (removing a year, training on other years, and testing on selected year).

For this, I need to be able to re-train the ensemble again like an scikit-learn pipeline.
How can I retain the ensamble itself?
The '.fit' function does not seem to work like in sklearn estimator convention (getting a numpy array as input).

pplonski · 2021-04-19T07:29:56Z

Hi @drorhilman,

Looks like there are two issues:

Would you like to keep AutoML parameters not changed and train with other data? Or you want to keep only ensemble not changed? Right now there is no option to keep AutoML parameters and just retrain all models on other data.
In AutoML fit() should work with numpy data, might be a bug.

drorhilman · 2021-04-19T08:44:47Z

Hi @drorhilman,

Looks like there are two issues:

Would you like to keep AutoML parameters not changed and train with other data? Or you want to keep only ensemble not changed? Right now there is no option to keep AutoML parameters and just retrain all models on other data.

In AutoML fit() should work with numpy data, might be a bug.

thanks a lot!
I guess that I would like to do one of the options:

either get the structure of the final ensemble, including the hyper-parameters and preprocessing, and somehow wrap it as a scikit-learn pipeline, where I can fit/predict in a specific way.
Being able to run a custom CV scheme with the entire automl.fit workflow (where I provide the sets of train/test data) - so the final RMSE will reflect the custom-CV.

pplonski · 2021-04-19T09:01:57Z

@drorhilman we can do both options:

All details about models and preprocessing are saved in the model directory in framework.json and README.md files. But it might be a little inconvenient to manually recreate models.
We can add option to run custom CV. User will define a vector with numbers indicating a fold. Proposition code:

# below defines 3 folds for 6 example samples 
folds_array = np.array([0,0,1,1,2,2]) 
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds_array})
automl.fit(X, y)

For such custom validation, I will enable Ensemble by default but I will disable models stacking. It will be user responsibility to enable stacking if it is applicable (don't have leakage between custom folds). @drorhilman does second option work for you?

drorhilman · 2021-04-19T09:13:03Z

This sounds good. But I am not sure what the folds_array stands for: can I provide a list of indexes to be used as tests?
Something like:

test_indexes = [np.array([0,7,6,12,45]),  np.array([56,71,2,9,129]) ... ]

Where the training loop iterates over the list, using the indexes as test-data and the rest as training data? So the reported RMSE score will be for this scheme.

drorhilman · 2021-04-19T09:14:16Z

But actually, your suggested strategy can work as well (pretty easy to transfer between the two)

pplonski · 2021-04-19T09:21:25Z

The folds_array will be a vector indicating the folds. It doesn't need to start from 0. You mentioned that you would like to use year as fold indicator. If you heave a year column in your data, then you will be able to pass year column as folds_array.

pplonski · 2021-04-19T09:31:01Z

@drorhilman one more thing, do you have a time dependency in your data that you can't use future data for training? For example, you have data from 2015,2016,2017,2018 and when you use 2016 as a test fold then you can use only 2015 data for training (cant use 2017 and 2018 data). Maybe it will be good to add also explicit definition of train and test folds combination?

Code proposition:

folds = [2015, 2015, ..., 2016, 2016, ...]
# (train, test) splits definition for folds
splits = [
    ([2015], [2016]),
    ([2015, 2016], [2017]),
    ([2015, 2016, 2017], [2018])
]
automl = AutoML(validation_strategy={"validation_type": "custom", "folds": folds, "splits": splits})
automl.fit(X, y)

If splits will be not defined, then each fold from folds will be used for testing and others as train.

Default split if splits undefined:

 [
    ([2016, 2017, 2018], [2015]),
    ([2015, 2017, 2018], [2016]),
    ([2015, 2016, 2018], [2017]),
    ([2015, 2016, 2017], [2018])
]

drorhilman · 2021-04-19T09:32:40Z

This sounds right.

pplonski · 2021-04-23T11:30:00Z

@drorhilman I just found out that if it will be implemented in the similar way you suggested (based on array indices), then it will be compatible with all sklearn model_selection validation strategies: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

It will be similar to the cv parameter, like in GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html but it will only accept the list with (train, test) splits as arrays of indices.

drorhilman · 2021-04-23T11:43:01Z

Thank you very much @pplonski . When it will be added?

pplonski · 2021-04-23T11:45:24Z

@drorhilman I hope that today

pplonski · 2021-04-23T16:07:06Z

@drorhilman changes are in the dev branch. To install package from dev:

pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev

The example use case:

import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

from sklearn import datasets

X, y = datasets.make_classification(
    n_samples=100,
    n_features=5,
    n_informative=4,
    n_redundant=1,
    n_classes=2,
    n_clusters_per_class=3,
    n_repeated=0,
    shuffle=False,
    random_state=0,
)

X = pd.DataFrame(X)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)

folds = pd.Series(X_train.index % 4)

splits = [
    ([0], [1,2,3]),
    ([0,1], [2,3]),
    ([0,1,2], [3]),
]

# define train and validation indices
cv = []
for split in splits:
    train_indices = X_train.index[folds.isin(split[0])]
    validation_indices = X_train.index[folds.isin(split[1])]
    cv += [(train_indices, validation_indices)]


automl = AutoML(

    mode="Compete",
    algorithms=["Xgboost"], 
    eval_metric="accuracy",
    start_random_models=1,
    validation_strategy={
        "validation_type": "custom"
    }
)
automl.fit(X_train, y_train, cv=cv)

There is an additional cv argument in the fit(). If validation is set to custom validation_strategy={"validation_type": "custom"} then cv parameter is used for validation. The cv should have a list of tuples. Each tuple define train and validation indices.

For the custom validation, I switched off the stacking and boost-on-errors steps. Should be enabled by the user only.

@drorhilman looking for feedback, I hope it will work for you!

drorhilman · 2021-04-24T15:40:44Z

Thanks. I will try this on Monday.

DevDavey · 2021-04-26T06:29:22Z

Thank you for adding this! I'll give it a shot tomorrow as well.

drorhilman · 2021-05-04T10:43:42Z

I wanted to update you that it seems to work OK for me. thanks!

pplonski · 2021-05-04T11:03:40Z

@drorhilman thank you for your feedback! Closing the issue.

wmotkowska-inpost · 2024-05-29T10:39:25Z

@pplonski I think the folds should be defined as follows

folds = pd.Series(X_train.index // (len(X_train) / num_splits))

The previous folds create a sequence of [0,1,2,3,0,1,2,3] instead of quarters split.

wmotkowska-inpost · 2024-05-29T11:33:33Z

@pplonski Also I have an issue with this code

        X_train.reset_index(inplace=True, drop=True)
        y_train.reset_index(inplace=True, drop=True)
        num_splits = 3
        folds = pd.Series(X_train.index // (len(X_train) / num_splits))        
        splits = [
            ([0], [1,2]),
            ([0,1], [2]),
        ]
        cv = []
        for split in splits:
            train_indices = X_train.index[folds.isin(split[0])]
            validation_indices = X_train.index[folds.isin(split[1])]
            cv += [(train_indices, validation_indices)]

This outputs an error
ERROR Problem with custom validation. positional indexers are out-of-bounds
IndexError: index 18 is out of bounds for axis 0 with size 18
validation_data["sample_weight"] = sample_weight.iloc[validation_index]
in https://github.com/mljar/mljar-supervised/blob/master/supervised/validation/validator_custom.py

While I checked and the data should work:

CV splits: [(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64'),
Int64Index([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], dtype='int64')),
(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype='int64'),
Int64Index([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype='int64'))]
X_train: RangeIndex(start=0, stop=28, step=1)

pplonski · 2024-05-29T12:17:39Z

Cześć @wmotkowska-inpost!

Thanks for reporting the issue. Could you please provide full code to reproduce this issue? Are you using sample weights in the training? There are no models trained at all?

wmotkowska-inpost · 2024-05-29T14:49:19Z

I recreated it on your data with the custom split and model as in my implementation.

import numpy as np
import pandas as pd
from supervised.automl import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

from sklearn import datasets

X, y = datasets.make_classification(
    n_samples=100,
    n_features=5,
    n_informative=4,
    n_redundant=1,
    n_classes=2,
    n_clusters_per_class=3,
    n_repeated=0,
    shuffle=False,
    random_state=0,
)

X = pd.DataFrame(X)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)

num_splits = 3
folds = pd.Series(X_train.index // (len(X_train) / num_splits))
splits = [
    ([0], [1, 2]),
    ([0, 1], [2]),
]
cv = []
for split in splits:
    train_indices = X_train.index[folds.isin(split[0])]
    validation_indices = X_train.index[folds.isin(split[1])]
    cv += [(train_indices, validation_indices)]

automl = AutoML(
            random_state=123,
            algorithms=[
                "Baseline",
                "Linear",
                "Decision Tree",
                "Random Forest",
                "Extra Trees",
                "Xgboost",
                "LightGBM",
                "CatBoost",
                "Nearest Neighbors",
            ],
            ml_task="regression",
            start_random_models=3,
            golden_features=False,
            features_selection=True,
            hill_climbing_steps=3,
            top_models_to_improve=3,
            train_ensemble=True,
            explain_level=1,
            validation_strategy={"validation_type": "custom"},
            boost_on_errors=True,
            verbose=0,
        )

automl.fit(
            X_train,
            y_train,
            cv=cv
        )

predictions = automl.predict(X_test)

The predictions look like this:

array([0.59385095, 0.32481747, 0.08021932, 0.86107466, 0.80197426,
       0.54666968, 0.56548217, 0.27991819, 0.64660518, 0.86241633,
       0.54489969, 0.60784996, 1.00091712, 0.48493497, 0.34176998,
       0.66428256, 0.39565906, 0.30089507, 0.25014441, 0.45435476])

It has produced the ensemble model and predictions and the run has not stopped.

It is just confusing that i get this error info despite the fact it seems to work.

This is the info under the deployment.

Custom validation strategy
Split 0.
Train 27 samples.
Validation 53 samples.
Split 1.
Train 54 samples.
Validation 26 samples.
Drop features ['random_feature']
2024-05-29 16:24:02,437 supervised.exceptions ERROR Problem with custom validation. positional indexers are out-of-bounds

Traceback (most recent call last):
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1587, in _get_list_axis
    return self.obj._take_with_is_copy(key, axis=axis)
  File "c...\.venv\lib\site-packages\pandas\core\series.py", line 945, in _take_with_is_copy
    return self.take(indices=indices, axis=axis)
  File "c...\.venv\lib\site-packages\pandas\core\series.py", line 930, in take
    new_index = self.index.take(indices)
  File "...\.venv\lib\site-packages\pandas\core\indexes\base.py", line 1183, in take
    taken = algos.take(
  File "...\.venv\lib\site-packages\pandas\core\algorithms.py", line 1577, in take
    result = arr.take(indices, axis=axis)
IndexError: index 53 is out of bounds for axis 0 with size 53

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "...\.venv\lib\site-packages\supervised\validation\validator_custom.py", line 101, in get_split
    validation_data["sample_weight"] = sample_weight.iloc[validation_index]
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1616, in _getitem_axis
    return self._get_list_axis(key, axis=axis)
  File "...\.venv\lib\site-packages\pandas\core\indexing.py", line 1590, in _get_list_axis
    raise IndexError("positional indexers are out-of-bounds") from err
IndexError: positional indexers are out-of-bounds

pplonski · 2024-06-03T09:16:18Z

Thank you @wmotkowska-inpost, I can reproduce the issue. The problem is with boost on errors step. It is a step in which we manipulate the sample weight for each row based on error in predictions in out-of-folds. The AutoML was working because only models from boost on errors were not trained. I'm working on fix.

pplonski · 2024-06-03T13:38:14Z

@wmotkowska-inpost I released version 1.1.9 on PyPi with fix. Thank you for reporting the issue.

wmotkowska-inpost · 2024-06-04T07:36:10Z

Hi @pplonski, thank you for finding the source of the issue. I have changed the version from 1.1.4 to 1.1.9 and get this info 'KNeighborsRegressorAlgorithm' object has no attribute 'classes_'
Problem during computing permutation importance. Skipping ...
#669

pplonski · 2024-06-04T07:39:44Z

Hi @wmotkowska-inpost,

Thanks for reporting the issue, do you have the issue after updating scikit-learn to the latest version?

wmotkowska-inpost · 2024-06-04T08:49:27Z

scikit-learn = "1.5.0"
sktime = "^0.30.0"

pplonski · 2024-06-04T08:53:59Z

the same?

wmotkowska-inpost · 2024-06-04T08:55:30Z

are there any another dependencies i should check?

pplonski · 2024-06-04T09:03:58Z

no, thanks, looks like there is an issue on our side. The error message means that feature importance can't be produced for kNN models. We will fix it.

Are you able to perform analysis and build models despite this error? Do you get good results? May I ask what is your use case?

wmotkowska-inpost · 2024-06-04T09:44:38Z

Thanks :)

Yes,it seems that I get the full set of models and an ensemble, I also get the predictions from kNN but no permutation importance outputvfor this model. I can work with the current state. :) I use the automl for stacking.

pplonski changed the title ~~How can evaluate the trained ensamble myself in a specialized CV strategy~~ Custom CV strategy Apr 19, 2021

pplonski self-assigned this Apr 19, 2021

pplonski added the enhancement New feature or request label Apr 19, 2021

pplonski added this to the 0.10.4 milestone Apr 19, 2021

pplonski added a commit that referenced this issue Apr 23, 2021

add custom validator (#380)

2e0b5e3

pplonski closed this as completed May 4, 2021

pplonski reopened this May 29, 2024

pplonski added a commit that referenced this issue Jun 3, 2024

disable boost on errors steps for custom validation strategy (#380)

d2a8d3a

pplonski closed this as completed Jun 3, 2024

palbha mentioned this issue Jul 23, 2024

Error while running CV Split by Validation startegy -supervised.exceptions ERROR You need to specify cv as list or iterable #737

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom CV strategy #380

Custom CV strategy #380

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 23, 2021

drorhilman commented Apr 23, 2021

pplonski commented Apr 23, 2021

pplonski commented Apr 23, 2021 •

edited

Loading

drorhilman commented Apr 24, 2021

DevDavey commented Apr 26, 2021

drorhilman commented May 4, 2021

pplonski commented May 4, 2021

wmotkowska-inpost commented May 29, 2024 •

edited

Loading

wmotkowska-inpost commented May 29, 2024 •

edited by pplonski

Loading

pplonski commented May 29, 2024

wmotkowska-inpost commented May 29, 2024 •

edited by pplonski

Loading

pplonski commented Jun 3, 2024

pplonski commented Jun 3, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

Custom CV strategy #380

Custom CV strategy #380

Comments

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 19, 2021

pplonski commented Apr 19, 2021

drorhilman commented Apr 19, 2021

pplonski commented Apr 23, 2021

drorhilman commented Apr 23, 2021

pplonski commented Apr 23, 2021

pplonski commented Apr 23, 2021 • edited Loading

drorhilman commented Apr 24, 2021

DevDavey commented Apr 26, 2021

drorhilman commented May 4, 2021

pplonski commented May 4, 2021

wmotkowska-inpost commented May 29, 2024 • edited Loading

wmotkowska-inpost commented May 29, 2024 • edited by pplonski Loading

pplonski commented May 29, 2024

wmotkowska-inpost commented May 29, 2024 • edited by pplonski Loading

I recreated it on your data with the custom split and model as in my implementation.

The predictions look like this:

It has produced the ensemble model and predictions and the run has not stopped.

It is just confusing that i get this error info despite the fact it seems to work.

This is the info under the deployment.

pplonski commented Jun 3, 2024

pplonski commented Jun 3, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Jun 4, 2024

wmotkowska-inpost commented Jun 4, 2024

pplonski commented Apr 23, 2021 •

edited

Loading

wmotkowska-inpost commented May 29, 2024 •

edited

Loading

wmotkowska-inpost commented May 29, 2024 •

edited by pplonski

Loading

wmotkowska-inpost commented May 29, 2024 •

edited by pplonski

Loading