Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better warm starting with automatically converting pipeline or pipeline string to gama individual string #156

Open
prabhant opened this issue Jun 13, 2022 · 6 comments
Labels
enhancement An improvement to an existing component, e.g. optimization. feature Something new, expands what you can do with GAMA.
Milestone

Comments

@prabhant
Copy link

It's a lot of work currently for a user to convert model string to gama individual string format. It will be great if we can have a function for that or GAMA can automatically take the pipeline string for warmstarting

@prabhant prabhant added enhancement An improvement to an existing component, e.g. optimization. feature Something new, expands what you can do with GAMA. labels Jun 13, 2022
@PGijsbers PGijsbers added this to the v22.1+ milestone Jul 27, 2022
@prabhant
Copy link
Author

Here is the gist for my last experiment where I still had to eliminate some of the search space to make it work https://gist.github.com/prabhant/ebc0f4f9eb17fec4a80047f2aeb4b184

@WmWessels
Copy link

WmWessels commented Jul 13, 2023

I have tried working with the code posted by @prabhant, however, when I try to warm-start gama, I get an error. The code for reproducing the error is listed below:

from sklearn.decomposition import FastICA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from gama.configuration.classification import clf_config

p = Pipeline(steps=[('imputation',SimpleImputer(strategy='median')),('2',RobustScaler()),('1',FastICA(tol=0.75,whiten='unit-variance')),('0',ExtraTreesClassifier(max_features=0.8,min_samples_leaf=2,min_samples_split=5))])

try:
  if p['imputation']:
    p = p[1:]
except:
  pass

l = []
for i in range(len(p)):
  l.append(str(p[i].__class__()).replace('()',''))
#making string from pipeline
s = []
#For making list
for i in reversed(l):
  s.append(f"{i}(")
#for making data 
data_string ="data"
s.append(data_string)
#for making hyperparameters
for i in range(len(p)):
  keys = p[i].__dict__.keys() & clf_config[p[i].__class__].keys()
  for j in keys:
    # if j in clf_config[p[i].__class__].keys():
    if j == list(keys)[-1]:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}'")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}")
    else:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}', ")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}, ")
  s.append('), ')
s[-1] = ')'

#incorrect format:
#print(s)
['ExtraTreesClassifier(', 'FastICA(', 'RobustScaler(', 'data', '), ', 'FastICA.tol=0.75, ', "FastICA.whiten='unit-variance'", '), ', 'ExtraTreesClassifier.min_samples_leaf=2, ', "ExtraTreesClassifier.criterion='gini', ", 'ExtraTreesClassifier.min_samples_split=5, ', 'ExtraTreesClassifier.bootstrap=False, ', 'ExtraTreesClassifier.n_estimators=100, ', 'ExtraTreesClassifier.max_features=0.8', ')']

#but when I do this:

warm_starting_candidates = [''.join(s)]

#I think this is the correct format
#print(warm_starting_candidates)

["ExtraTreesClassifier(FastICA(RobustScaler(data), FastICA.tol=0.75, FastICA.whiten='unit-variance'), ExtraTreesClassifier.min_samples_leaf=2, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.min_samples_split=5, ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.n_estimators=100, ExtraTreesClassifier.max_features=0.8)"]

#However, in the context of warm-starting, I get the following error:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from gama import GamaClassifier

if __name__ == '__main__':
    X, y = load_breast_cancer(return_X_y=True)  
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

    automl = GamaClassifier(max_total_time=180, store="nothing")
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train, warm_start = warm_starting_candidates)

#Error Message
KeyError: "Could not find Terminal of type 'ExtraTreesClassifier.min_samples_leaf=2'."

@prabhant
Copy link
Author

@Wman1001 This means you might not have this value in your search space, can you check it?

@WmWessels
Copy link

Correct, I added them to the search space, which does fix the issue. I was wondering whether it is intended that the tree based models have an empty list for min_samples_leaf and min_samples_split by default?

@prabhant
Copy link
Author

The search space depends on your needs, but if you do not define any values then GAMA only takes the default values for the parameter.

@PGijsbers
Copy link
Member

Actually, a

Classifier: {
 "hyperparameter": []
 }

Definition means that the hyperparameter is defined on a search space level instead of just the classifier level, which allows certain hyperparameters to be "shared" across different classifiers. See also 1 and 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing component, e.g. optimization. feature Something new, expands what you can do with GAMA.
Projects
None yet
Development

No branches or pull requests

3 participants