A robust hyperparameter tuner for high-dimensional, categorically and/or conditionally-parameterized, intractable optimization problems (Ex. Neural Network)
Neural network hyper parameter optimization is an especially challenging task due to a few different reasons:
-
Parameters can be highly codependent. Adjusting a single parameter may not be enough to get over a saddle point, you will likely have to adjust many parameters simultaneously to escape local minima.
-
You may have scenarios where adjusting a parameter can completely alter the performance of other parameters as well, making it difficult to sample historically better values more often due to the risk of sampling values based on the modeling of a suboptimal parameter space.
-
Attempting to model which parameters are more likely to be better will also require a lot of trials to overcome this level of variance/noise. Even then, as alluded to above, the best parameter value on average will not always be the best parameter value overall.
-
The search space can be highly non-convex, with many categorical, discrete-valued, conditional, and nested parameters. This sort of parameter space makes it very difficult to generate any sort of quantitative probability model.
-
For high-end performance where local minima is not good enough and we want the best possible performance, or for domains where there has not been extensive research and there is not a general understanding on what types of choices work better than others, we might want to tune many parameters at once and the dimensionality of the search space can get very large, such that Bayesian Optimization-related methods are not very effective.
Recent research has discussed there is not a lot of reproducible evidence that show any of today's state of the art techniques significantly beat a plain old random search with some form of early stopping- https://arxiv.org/pdf/1902.07638.pdf
All of the issues mentioned above make it very difficult if not impossible to do any sort of intelligently guided search for NN architecture/training hyperparameters. That is why this tuner opts against attempting to build some sort of surrogate function or gradient-based method to model the probability of the search space, and instead aims for something simpler and hopefully more robust to the problems we're facing. The user shouldn't expect a magic algorithm that takes the least amount of steps possible to reach global minima, but they should be able to expect something that is more efficient than random search.
The StoRM tuner can be thought of intuitively as a combination of a grid search combined with random search, where the "distance" between the current best configuration and the next evaluation candidate, is probability based. We randomly mutate the current best configuration along different axes (and sometimes even multiple times along the same axis). The number of mutations made for the next evaluation candidate, is based on a user-defined probability.
The default value for randomize_axis_factor
is 0.5 which means that there is a 50% chance just one mutation will be made. There is a 25% chance two mutations will be made. A 12.5% chance that three mutations will be made, and so on.
This approach aims to address the issues stated above by allowing enough freedom so that we do respect the complexity of the search space, while also probabilistically restricting how different the next evaluation candidate is from the current best, to provide some level of guidance and locality to the search.
pip install storm-tuner
Here we define our hyperparameter space by providing our own configuration building method.
NOTE: The configuration building method is an important component of StoRM's functionality. Even though parameters can be accessed elsewhere, for example when the model is trained or during data preprocessing, all parameters must be defined in this method. By executing the model building function prior to any trial execution logic, we draw a configuration based upon which parameters are actually in use, and ensure we do not waste resources testing virtually identical configurations.
After we define our HP space, it will usually make the most sense for our function to return an untrained model at this point. However, one may opt to return more than a model in some circumstances (for example an optimizer as well) or they may even opt to not return anything at all and build the model later. This is entirely up to the user.
All parameters take the form: hp.Param('parameter_name', [value1, value2...], ordered=False)
. Setting a parameter to ordered=True
, will ensure the tuner is only able to select adjacent values per a single mutation step. This is an important feature for parameters where there is ordinality.
Let's take a look at an example of a configuration building method.
def build_model(hp, *args):
model = Sequential()
# we can define train-time param in the build model function to be used later on in run_trial
hp.Param('batch_size', [32, 64, 128, 256], ordered=True)
# storm works easily with loops as well
for x in range(hp.Param('num_layers', [1, 2, 3, 4], ordered=True)):
model.add(Dense(hp.Param('kernel_size_' + str(x), [50, 100, 200], ordered=True)))
# here is a categorical parameter that most tuners do not do well with
activation_choices = ['tanh', 'softsign', 'selu', 'relu', 'elu', 'softplus']
model.add(Activation(hp.Param('activation', activation_choices)))
# example of nested parameter
if hp.Param('dropout', [True, False]):
model.add(Dropout(hp.Param('dropout_val', [0.1, 0.2, 0.3, 0.4, 0.5], ordered=True)))
model.add(Dense(1))
model.compile(loss='mse', optimizer=SGD(momentum=0.9))
return model
We are required to override the run_trial()
method for our own Tuner implementation, this encapsulates all the execution of a single trial. All the run_trial
method needs to do is assign a score to the trial self.score_trial(trial, score)
using a given parameter configuration trial.hyperparameters
. How the user generates a score for the supplied configuration is entirely up to them.
The self.build_fn(hp)
function called in run_trial
is what will supply us with a blank model (as mentioned above).
As we can see, any arguments you provide in the search()
entry method, can be accessed in your run_trial()
method. This is also true for build_model
as well, if any parameters need to be passed in at this scope.
Now let's take a look at an example of a run_trial method.
from storm_tuner import Tuner
class MyTuner(Tuner):
def run_trial(self, trial, *args):
# retrieve hyperparameters
hp = trial.hyperparameters
# retrieve any parameters supplied via main search method
X_train, y_train, X_test, y_test = args[0], args[1], args[2], args[3]
# create our model/configuration
model = self.build_fn(hp)
# train model
history = model.fit(X_train,
y_train,
epochs=25,
validation_data=(X_test, y_test),
# here we access a parameter at train time
batch_size=hp.values['batch_size'])
# calculate score
score = history.history['val_loss'][-1]
# assign score to the trial
self.score_trial(trial, score)
We initialize our Tuner and provide our training data
tuner = MyTuner(objective_direction='min', build_fn=build_model)
tuner.search(X_train, y_train, X_test, y_test)
One of the main goals of this library is to save the user from having to tune the tuner itself. An excess of tunable variables can confuse the user, make intuitive documentation more difficult, and even have a substantial negative effect on the tuner's performance if they are not set correctly.
With this tuner we have 2 main adjustable parameters to customize your search preferences...
def __init__(self,
build_fn=build_model,
objective_direction='min',
init_random=10,
max_trials=100,
randomize_axis_factor=0.5)
init_random
: How many initial iterations to perform random search. This is helpful for getting the search to an decent configuration, so that we don't waste too much time descending from a suboptimal starting point. Recommended range: between 3 and 10
randomize_axis_factor
: The main exploitative/explorative tradeoff parameter (0 -> 1). A value closer to 1 means that steps will generally have more mutations. A value closer to 0 will mean steps are more likely to only do a single mutation. A value bewteen 0.5 and 0.75 seems reasonable in most cases and will almost always be good enough.
NOTE: For both of these parameters, the higher degree of parameter codependency that is expected and/or the more parameters that you are optimizing, it may be beneficial to set these values higher.
The StoRM tuner is designed to be as simple as possible. The tuner supplies a parameter configuration and the user assigns this configuration a score. We leave it up to the user to implement any number of methodologies that might fit their goals and use cases. These can include:
- Techniques to reduce the impact of variance on optimization (k-fold cross validation, trailing average of epoch loss, average of multiple trains)
- Techniques where we might abandon training of the current model if there is a high enough certainty that this model will not beat the best score at the end of the training. Because the tuner only cares if we beat the best score, not necessarily how much a trial lost, this means we can safely discard the configuration by just returning from our trial at this point. This will cause the trial's score to be defaulted to None so it is not tested again. Note: if we decide to run metrics on variables across all trials after tuning is complete, this may skew the results.
Storm is designed to be as generic as possible AND there is actually nothing specific to neural networks or a particular ML library coded in this project. This type of freedom allows the user to optimize parameters used at various stages of the experiment... data pre-processing, architecture, training, etc.
Because of the tuner's experiment-agnostic approach, StoRM can be even more advantageous when used with various branches of ML that utilize NNs for the model yet have another set of hyper parameters to optimize that can make the search space even trickier and harder for traditional approaches to handle. For example, reinforcement learning.
Of course, most of the success of StoRM revolves around the user's ability to parameterize the search space appropriately. StoRM will only work as well as the parameter space it operates on.
A few things to keep in mind when parameterizing your search space...
- For a parameter with ordinal values that can also be turned completely off by setting the value to 0, such as dropout, weight decay, or noise, one might consider adding an extra boolean parameter to "unlock" the ordinal parameter. If optimization initializes to a suboptimal value and this feature is not good for this particular problem, it will likely take more iterations to traverse the value space than it would to simply turn the feature off for a configuration to escape this minimum.
- For ordinal parameters where it is not an option to use an additional "gateway" parameter, it is suggested to keep the amount of values under 10 and ideally around 5 for reasons explained above.
- For parameters that are coupled with one another (for example learning rate and weight decay). One might decide to parameterize weight decay as a factor of LR, instead of optimizing both separately. This way, we only search for the best step size to weight decay ratio, instead of forcing the model to try and find LR and WD values that meet at the right scale.
- Last but not least, most NN hyper parameters are not very sensitive and it is far more important to find a good general area/scale for a parameter than it is for example to know that a learning rate of 1e-3 performs slightly better than 2e-3. We want to ensure there is a good distribution of values such that we capture the various points a parameter is commonly experimented with, yet do not have an over-abundance of ordinal values so that our tuner has to stochastically traverse this space if initialized to a poor value. StoRM leaves it up to the user to provide the appropriate binning/sampling of values (log, exp, linear, etc.) which is very parameter-dependent.
In most cases the selection of values should be fairly intuitive...
batch size: [32, 64, 128, 256]
momentum: [1 - pow(10, -0.5), 1 - pow(10, -1), 1 - pow(10, -1.5), 1 - pow(10, -2)]
kernel size: [50, 100, 150, 200, 250]
lr: [1e-2, 1e-3, 1e-4]
At the end of the day there is then nothing stopping the user from re-parameterizing their search space after narrowing in on promising areas from running storm tuner at a broader scope.
Although the examples here use Tensorflow/Keras, as mentioned, StoRM works with any library or algorithm (sklearn, pytorch, etc.). One simply defines any parameters we are optimizing in build_fn
. The user can decide to return a model right here and utilize StoRM's inline parameterization, or they can opt to use parameters in run_trial
.
StoRM should be used for optimization problems where the parameter space can be high dimensional and has many categorical/conditional variables. StoRM should also be used when parameters do not need to be optimized at very fine levels, but rather we need to find good general choices. In short, StoRM will be most effective when there are many codependent decisions to be made.
StoRM will probably not be the best tuner to use if you are optimizing many real valued parameters that always have an affect on the target function, with low codependencies, and which can be sensitive to small changes such that we should offer the real valued spectrum of values, and not just a few bins to chose from. For these types of problems, Bayesian Optimization will still be more effective.
A StoRM Trial
like the one used in the run trial method above, has a metrics dictionary
to easily allows us to store any pertinent information to this trial for review later on.
Run compare_to_random_search.py
from examples
directory to compare performance to random search for yourself.
Here we can see that over ten trials each, StoRM has a clear advantage.
tuned scores mean: 0.0013418583347811364 | stdev: 0.001806810901973602
random scores mean: 0.010490878883283586 | stdev: 0.006145158964894091
Benchmarking, examples, bug submissions, and PRs all welcome!