Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init checkin to add LassoCV and RERF to optimizers #263

Merged

Conversation

edcthayer
Copy link
Contributor

@edcthayer edcthayer commented Aug 6, 2021

Added LassoCrossValidated (LassoCV) and RegressionEnhancedRandomForest (RERF) regression models to the list of surrogate models available for optimizers. This required creating MultiObjective versions for each of these regression models. Fixed some bugs found via testing with random surrogate_model parameters. Details below by file added/changed:

source/Mlos.Python/mlos/Optimizers/BayesianOptimizerConfigStore.py:

  1. Added LassoCV, MultiObjectiveLassoCV, RERF, MultiObjective RERF to surrogate_model_implementation list and expanded the resulting hyper grid. Didn't change the default from historical HomogeneousRandomForest.

source/Mlos.Python/mlos/Optimizers/BayesianOptimizer.py:

  1. Added if-elif-else code to correctly instantiate the surrogate_model based on the optimizer_config.surrogate_model_implementation.
  2. Extended assert test to include new surrogate models.

source/Mlos.Python/mlos/Optimizers/RegressionModels/LassoCrossValidatedConfigStore.py:

  1. Corrected dimension type for LassoCV cv parameter (continuous --> discrete).
  2. Restricted ranges on some model_configs to avoid Windows faults discovered in random config tests.

source/Mlos.Python/mlos/Optimizers/RegressionModels/MultiObjectiveLassoCrossValidated.py:
New class to allow LassoCV for multi-objective optimizations.

source/Mlos.Python/mlos/Optimizers/RegressionModels/MultiObjectiveRegressionEnhancedRandomForest.py:
New class to allow RERF for multi-objective optimizations.
Note: the .copy() on line 41 is needed b/c model_config.perform_initial_random_forest_hyper_parameter_search value is changed (True -> False) once grid search completes for the random forest fit().

source/Mlos.Python/mlos/Optimizers/RegressionModels/MultiObjectiveRegressionEnhancedRandomForest.py:

  1. Correctly capture the random forest hyper parameters returned from the grid search (line 320).
  2. Cleaned up some initializations.

source/Mlos.Python/mlos/Optimizers/RegressionModels/unit_tests/TestMultiObjectiveLassoCrossValidated.py:
New unit tests for new class.

source/Mlos.Python/mlos/Optimizers/RegressionModels/unit_tests/TestMultiObjectiveRegressionEnhancedRandomForest.py:
New unit tests for new class.

@edcthayer edcthayer requested a review from sergiy-k August 6, 2021 22:19
@@ -59,20 +61,47 @@ def __init__(

# Now let's put together the surrogate model.
#
print(f'self.optimizer_config.surrogate_model_implementation: {self.optimizer_config.surrogate_model_implementation}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f'self.optimizer_config.surrogate_model_implementation: {self.optimizer_config.surrogate_model_implementation}')
self.logger.info(f'self.optimizer_config.surrogate_model_implementation: {self.optimizer_config.surrogate_model_implementation}')

CategoricalDimension(name="fit_intercept", values=[False, True]),
CategoricalDimension(name="normalize", values=[False, True]),
CategoricalDimension(name="precompute", values=[False, True]),
DiscreteDimension(name="max_iter", min=0, max=10 ** 5),
ContinuousDimension(name="tol", min=0, max=2 ** 10),
DiscreteDimension(name="max_iter", min=100, max=5 * 10 **3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DiscreteDimension(name="max_iter", min=100, max=5 * 10 **3),
DiscreteDimension(name="max_iter", min=100, max=5 * (10 ** 3)),

@@ -89,6 +91,10 @@ def __init__(
self.partial_hat_matrix_ = 0
self.regressor_standard_error_ = 0

# THE HACK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to explain a little more here. If I remember right:

When LassoCV is used as part of RERF, it cannot reasonably compute the upper and lower bounds on its input space dimensions, as they are a polynomial combination of inputs to RERF. Thus, it approximates them with the empirical min and max. These approximations are biased: the lower bound is too large, the upper bound is too small. Consequently, during scoring, LassoCV is likely to see input outside of these bounds, but we still want LassoCV to produce predictions for those points. So we introduce a little hack: whenever LassoCV is instantiated as part of RERF, it should skip input filtering on predict. This field, controls this behavior.

Feel free to just copy-paste that in, or polish it to your liking!

# add small noise to x to remove singularity,
# expect prediction confidence to be reduced (wider intervals) by doing this
self.logger.info(
f"Adding noise to design matrix used for prediction confidence due to condition number {condition_number} > 10^10."
f"Adding noise to design matrix used for prediction confidence due to condition number {condition_number} > 10^4."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10**4

Suggested change
f"Adding noise to design matrix used for prediction confidence due to condition number {condition_number} > 10^4."
f"Adding noise to design matrix used for prediction confidence due to condition number {condition_number} > 10**4."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear what you mean... but my CDO strongly suggests that we should stick to the Python exponentiation operator :)



class MultiObjectiveLassoCrossValidated(NaiveMultiObjectiveRegressionModel):
"""Maintains multiple HomogeneousRandomForestRegressionModels each predicting a different objective.
Copy link
Contributor

@byte-sculptor byte-sculptor Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Maintains multiple HomogeneousRandomForestRegressionModels each predicting a different objective.
"""Maintains multiple LassoCrossValidatedRegressionModels each predicting a different objective.

)


# We just need to assert that the model config belongs in homogeneous_random_forest_config_store.parameter_space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We just need to assert that the model config belongs in homogeneous_random_forest_config_store.parameter_space.
# We just need to assert that the model config belongs in lasso_cross_validated_config_store.parameter_space.



class MultiObjectiveRegressionEnhancedRandomForest(NaiveMultiObjectiveRegressionModel):
"""Maintains multiple HomogeneousRandomForestRegressionModels each predicting a different objective.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Maintains multiple HomogeneousRandomForestRegressionModels each predicting a different objective.
"""Maintains multiple RegressionEnhancedRandomForestRegressionModel each predicting a different objective.

)


# We just need to assert that the model config belongs in homogeneous_random_forest_config_store.parameter_space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We just need to assert that the model config belongs in homogeneous_random_forest_config_store.parameter_space.
# We just need to assert that the model config belongs in regression_enhanced_random_forest_config_store.parameter_space.

for output_dimension in output_space.dimensions:
print(f'output_dimension.name: {output_dimension.name}')
lasso_model = LassoCrossValidatedRegressionModel(
model_config=model_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copy the model_config in multi-objective RERF, but not here. Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values in the model config are altered by the random forest GridSearchCV for the RERF. When these configs are assigned to different objectives, they stomped all over each other. I'll track down the lines in RERF model that alter the model_config and explain this in the MultiObjectiveRERF code where you've spotted this difference.

# TODO : determine min sample needed to fit based on model configs
random_forest_should_fit = True
return root_base_model_should_fit and random_forest_should_fit
# since polynomial basis functions decrease the degrees of freedom (TODO: add reference),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat :)

num_testing_samples = 10
elif objective_function_config_name == '5_mutually_exclusive_polynomials':
num_training_samples = 100
num_testing_samples = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_testing_samples = 50
num_testing_samples = 50
else:
assert False

num_testing_samples = 10
elif objective_function_config_name == '5_mutually_exclusive_polynomials':
num_training_samples = 100
num_testing_samples = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_testing_samples = 50
num_testing_samples = 50
else:
assert False

@edcthayer edcthayer merged commit 791d670 into microsoft:main Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants