Dev hyperband (#405)

* support hyperband * add example for hyperband * register Hyperband in tuner * after debug * update doc * trivial change * update spec validation of yaml config * modify nnictl launcher * modify nnimanager and util to support advisor * Quick fix nnictl config logic (#289) * fix nnictl bug * fix install.sh * add desc for Dockerfile.build.base * update document for Dockerfile * update * refactor port detect * update * refactor NNICTLDOC.md * add document for pai and nnictl * add default value for port * add exception handling in trial_keeper.py * fix port bug * fix resume * fix nnictl resume and fix nnictl stop * fix document * update * refactor nnictl * update * update doc * update * update nnictl * fix comment * revert dockerfile * update * update * update * fix nnictl error hit * fix comments * fix bash-completion * fix paramiko install * quick fix resume logic * update * quick fix nnictl * refactor sdk main * update unit test accordingly * update example's config file * update restserver validation * PR merge to 0.3 (#297) * refactor doc * update with Mao's suggestions * Set theme jekyll-theme-dinky * update doc * fix links * fix links * fix links * merge * fix links and doc errors * merge * merge * merge * merge * Update README.md (#288) added License badge * merge * updated the "Contribute" part (merged Gems' wiki in, updated ReadMe) * fix link * fix doc mistakes and broken links. (#271) * refactor doc * update with Mao's suggestions * Set theme jekyll-theme-dinky * updated the "Contribute" part (merged Gems' wiki in, updated ReadMe) * fix link * Update README.md * Fix misspelling in examples/trials/ga_squad/README.md * revise the installation cmd to v0.2 * revise to install v0.2 * remove files * update * remove enas readme (#292) * support checkpoint directory * Fix datastore performance issue (#301) * fix pylint * Fix nnictl in v0.3 (#299) Fix old version of config file fix sklearn requirements Fix resume log logic * modify log * trivial changes * update example * update makefile * update launcher.py to fix the problem of finding main.js * debug * add hyperparameter info into trial_end api * fix bug and update example * fix error induced by merge * support initialize * add doc for hyperband * fix bugs and add config_pai * fix bugs and add config_pai * fix bugs and add config_pai * fix bugs and add config_pai * update doc * add doc for advisor * fit * modification based on hui's comments * update doc
microsoft · Nov 30, 2018 · a387250 · a387250
1 parent d2f0638
commit a387250
Show file tree

Hide file tree

Showing 23 changed files with 1,067 additions and 132 deletions.
diff --git a/docs/howto_2_CustomizedTuner.md b/docs/howto_2_CustomizedTuner.md
@@ -6,7 +6,7 @@ So, if user want to implement a customized Tuner, she/he only need to:
 
 1) Inherit a tuner of a base Tuner class
 2) Implement receive_trial_result and generate_parameter function
-3) Write a script to run Tuner
+3) Configure your customized tuner in experiment yaml config file
 
 Here ia an example:
 
@@ -93,3 +93,6 @@ More detail example you could see:
 > * [evolution-tuner](../src/sdk/pynni/nni/evolution_tuner)
 > * [hyperopt-tuner](../src/sdk/pynni/nni/hyperopt_tuner)
 > * [evolution-based-customized-tuner](../examples/tuners/ga_customer_tuner)
+
+## Write a more advanced automl algorithm
+The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called `advisor` which directly inherits from `MsgDispatcherBase` in [`src/sdk/pynni/nni/msg_dispatcher_base.py`](../src/sdk/pynni/nni/msg_dispatcher_base.py). Please refer to [here](howto_3_CustomizedAdvisor) for how to write a customized advisor.
diff --git a/docs/howto_3_CustomizedAdvisor.md b/docs/howto_3_CustomizedAdvisor.md
@@ -0,0 +1,39 @@
+# **How To** - Customize Your Own Advisor
+
+*Advisor targets the scenario that the automl algorithm wants the methods of both tuner and assessor. Advisor is similar to tuner on that it receives trial configuration request, final results, and generate trial configurations. Also, it is similar to assessor on that it receives intermediate results, trial's end state, and could send trial kill command. Note that, if you use Advisor, tuner and assessor are not allowed to be used at the same time.*
+
+So, if user want to implement a customized Advisor, she/he only need to:
+
+1) Define an Advisor inheriting from the MsgDispatcherBase class
+2) Implement the methods with prefix `handle_` except `handle_request`
+3) Configure your customized Advisor in experiment yaml config file
+
+Here ia an example:
+
+**1) Define an Advisor inheriting from the MsgDispatcherBase class**
+```python
+from nni.msg_dispatcher_base import MsgDispatcherBase
+
+class CustomizedAdvisor(MsgDispatcherBase):
+    def __init__(self, ...):
+        ...
+```
+
+**2) Implement the methods with prefix `handle_` except `handle_request`**
+
+Please refer to the implementation of Hyperband ([src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py](../src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py)) for how to implement the methods.
+
+**3) Configure your customized Advisor in experiment yaml config file**
+
+Similar to tuner and assessor. NNI needs to locate your customized Advisor class and instantiate the class, so you need to specify the location of the customized Advisor class and pass literal values as parameters to the \_\_init__ constructor.
+
+```yaml
+advisor:
+  codeDir: /home/abc/myadvisor
+  classFileName: my_customized_advisor.py
+  className: CustomizedAdvisor
+  # Any parameter need to pass to your advisor class __init__ constructor
+  # can be specified in this optional classArgs field, for example 
+  classArgs:
+    arg1: value1
+```
diff --git a/examples/trials/mnist-hyperband/config.yml b/examples/trials/mnist-hyperband/config.yml
@@ -0,0 +1,25 @@
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 2
+maxExecDuration: 100h
+maxTrialNum: 10000
+#choice: local, remote, pai
+trainingServicePlatform: local
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+advisor:
+  #choice: Hyperband
+  builtinAdvisorName: Hyperband
+  classArgs:
+    #R: the maximum STEPS (could be the number of mini-batches or epochs) can be
+    #   allocated to a trial. Each trial should use STEPS to control how long it runs.
+    R: 100
+    #eta: proportion of discarded trials
+    eta: 3
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
diff --git a/examples/trials/mnist-hyperband/config_pai.yml b/examples/trials/mnist-hyperband/config_pai.yml
@@ -0,0 +1,39 @@
+authorName: default
+experimentName: example_mnist_hyperband
+maxExecDuration: 1h
+maxTrialNum: 10000
+trialConcurrency: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+advisor:
+  #choice: Hyperband
+  builtinAdvisorName: Hyperband
+  classArgs:
+    #R: the maximum STEPS
+    R: 100
+    #eta: proportion of discarded trials
+    eta: 3
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  dataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  outputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
diff --git a/examples/trials/mnist-hyperband/mnist.py b/examples/trials/mnist-hyperband/mnist.py
@@ -0,0 +1,236 @@
+"""A deep MNIST classifier using convolutional layers."""
+
+import logging
+import math
+import tempfile
+import tensorflow as tf
+
+from tensorflow.examples.tutorials.mnist import input_data
+
+import nni
+
+FLAGS = None
+
+logger = logging.getLogger('mnist_AutoML')
+
+
+class MnistNetwork(object):
+    '''
+    MnistNetwork is for initlizing and building basic network for mnist.
+    '''
+    def __init__(self,
+                 channel_1_num,
+                 channel_2_num,
+                 conv_size,
+                 hidden_size,
+                 pool_size,
+                 learning_rate,
+                 x_dim=784,
+                 y_dim=10):
+        self.channel_1_num = channel_1_num
+        self.channel_2_num = channel_2_num
+        self.conv_size = conv_size
+        self.hidden_size = hidden_size
+        self.pool_size = pool_size
+        self.learning_rate = learning_rate
+        self.x_dim = x_dim
+        self.y_dim = y_dim
+
+        self.images = tf.placeholder(tf.float32, [None, self.x_dim], name='input_x')
+        self.labels = tf.placeholder(tf.float32, [None, self.y_dim], name='input_y')
+        self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
+
+        self.train_step = None
+        self.accuracy = None
+
+    def build_network(self):
+        '''
+        Building network for mnist
+        '''
+
+        # Reshape to use within a convolutional neural net.
+        # Last dimension is for "features" - there is only one here, since images are
+        # grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
+        with tf.name_scope('reshape'):
+            try:
+                input_dim = int(math.sqrt(self.x_dim))
+            except:
+                print(
+                    'input dim cannot be sqrt and reshape. input dim: ' + str(self.x_dim))
+                logger.debug(
+                    'input dim cannot be sqrt and reshape. input dim: %s', str(self.x_dim))
+                raise
+            x_image = tf.reshape(self.images, [-1, input_dim, input_dim, 1])
+
+        # First convolutional layer - maps one grayscale image to 32 feature maps.
+        with tf.name_scope('conv1'):
+            w_conv1 = weight_variable(
+                [self.conv_size, self.conv_size, 1, self.channel_1_num])
+            b_conv1 = bias_variable([self.channel_1_num])
+            h_conv1 = tf.nn.relu(conv2d(x_image, w_conv1) + b_conv1)
+
+        # Pooling layer - downsamples by 2X.
+        with tf.name_scope('pool1'):
+            h_pool1 = max_pool(h_conv1, self.pool_size)
+
+        # Second convolutional layer -- maps 32 feature maps to 64.
+        with tf.name_scope('conv2'):
+            w_conv2 = weight_variable([self.conv_size, self.conv_size,
+                                       self.channel_1_num, self.channel_2_num])
+            b_conv2 = bias_variable([self.channel_2_num])
+            h_conv2 = tf.nn.relu(conv2d(h_pool1, w_conv2) + b_conv2)
+
+        # Second pooling layer.
+        with tf.name_scope('pool2'):
+            h_pool2 = max_pool(h_conv2, self.pool_size)
+
+        # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
+        # is down to 7x7x64 feature maps -- maps this to 1024 features.
+        last_dim = int(input_dim / (self.pool_size * self.pool_size))
+        with tf.name_scope('fc1'):
+            w_fc1 = weight_variable(
+                [last_dim * last_dim * self.channel_2_num, self.hidden_size])
+            b_fc1 = bias_variable([self.hidden_size])
+
+        h_pool2_flat = tf.reshape(
+            h_pool2, [-1, last_dim * last_dim * self.channel_2_num])
+        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, w_fc1) + b_fc1)
+
+        # Dropout - controls the complexity of the model, prevents co-adaptation of features.
+        with tf.name_scope('dropout'):
+            h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob)
+
+        # Map the 1024 features to 10 classes, one for each digit
+        with tf.name_scope('fc2'):
+            w_fc2 = weight_variable([self.hidden_size, self.y_dim])
+            b_fc2 = bias_variable([self.y_dim])
+            y_conv = tf.matmul(h_fc1_drop, w_fc2) + b_fc2
+
+        with tf.name_scope('loss'):
+            cross_entropy = tf.reduce_mean(
+                tf.nn.softmax_cross_entropy_with_logits(labels=self.labels, logits=y_conv))
+        with tf.name_scope('adam_optimizer'):
+            self.train_step = tf.train.AdamOptimizer(
+                self.learning_rate).minimize(cross_entropy)
+
+        with tf.name_scope('accuracy'):
+            correct_prediction = tf.equal(
+                tf.argmax(y_conv, 1), tf.argmax(self.labels, 1))
+            self.accuracy = tf.reduce_mean(
+                tf.cast(correct_prediction, tf.float32))
+
+
+def conv2d(x_input, w_matrix):
+    """conv2d returns a 2d convolution layer with full stride."""
+    return tf.nn.conv2d(x_input, w_matrix, strides=[1, 1, 1, 1], padding='SAME')
+
+
+def max_pool(x_input, pool_size):
+    """max_pool downsamples a feature map by 2X."""
+    return tf.nn.max_pool(x_input, ksize=[1, pool_size, pool_size, 1],
+                          strides=[1, pool_size, pool_size, 1], padding='SAME')
+
+
+def weight_variable(shape):
+    """weight_variable generates a weight variable of a given shape."""
+    initial = tf.truncated_normal(shape, stddev=0.1)
+    return tf.Variable(initial)
+
+
+def bias_variable(shape):
+    """bias_variable generates a bias variable of a given shape."""
+    initial = tf.constant(0.1, shape=shape)
+    return tf.Variable(initial)
+
+
+def main(params):
+    '''
+    Main function, build mnist network, run and send result to NNI.
+    '''
+    # Import data
+    mnist = input_data.read_data_sets(params['data_dir'], one_hot=True)
+    print('Mnist download data down.')
+    logger.debug('Mnist download data down.')
+
+    # Create the model
+    # Build the graph for the deep net
+    mnist_network = MnistNetwork(channel_1_num=params['channel_1_num'],
+                                 channel_2_num=params['channel_2_num'],
+                                 conv_size=params['conv_size'],
+                                 hidden_size=params['hidden_size'],
+                                 pool_size=params['pool_size'],
+                                 learning_rate=params['learning_rate'])
+    mnist_network.build_network()
+    logger.debug('Mnist build network done.')
+
+    # Write log
+    graph_location = tempfile.mkdtemp()
+    logger.debug('Saving graph to: %s', graph_location)
+    train_writer = tf.summary.FileWriter(graph_location)
+    train_writer.add_graph(tf.get_default_graph())
+
+    test_acc = 0.0
+    with tf.Session() as sess:
+        sess.run(tf.global_variables_initializer())
+        for i in range(params['batch_num']):
+            batch = mnist.train.next_batch(params['batch_size'])
+            mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
+                                                    mnist_network.labels: batch[1],
+                                                    mnist_network.keep_prob: 1 - params['dropout_rate']}
+                                        )
+
+            if i % 10 == 0:
+                test_acc = mnist_network.accuracy.eval(
+                    feed_dict={mnist_network.images: mnist.test.images,
+                               mnist_network.labels: mnist.test.labels,
+                               mnist_network.keep_prob: 1.0})
+
+                nni.report_intermediate_result(test_acc)
+                logger.debug('test accuracy %g', test_acc)
+                logger.debug('Pipe send intermediate result done.')
+
+        test_acc = mnist_network.accuracy.eval(
+            feed_dict={mnist_network.images: mnist.test.images,
+                       mnist_network.labels: mnist.test.labels,
+                       mnist_network.keep_prob: 1.0})
+
+        nni.report_final_result(test_acc)
+        logger.debug('Final result is %g', test_acc)
+        logger.debug('Send final result done.')
+
+
+def generate_default_params():
+    '''
+    Generate default parameters for mnist network.
+    '''
+    params = {
+        'data_dir': '/tmp/tensorflow/mnist/input_data',
+        'dropout_rate': 0.5,
+        'channel_1_num': 32,
+        'channel_2_num': 64,
+        'conv_size': 5,
+        'pool_size': 2,
+        'hidden_size': 1024,
+        'learning_rate': 1e-4,
+        'batch_size': 32}
+    return params
+
+
+if __name__ == '__main__':
+    try:
+        # get parameters form tuner
+        RCV_PARAMS = nni.get_next_parameter()
+        logger.debug(RCV_PARAMS)
+        # run
+        params = generate_default_params()
+        params.update(RCV_PARAMS)
+        '''
+        If you use Hyperband, among the hyperparameters (i.e., key-value pairs) received by a trial, 
+        there is one more key called `STEPS` besides the hyperparameters defined by user. 
+        By using this `STEPS`, the trial can control how long it runs.
+        '''
+        params['batch_num'] = RCV_PARAMS['STEPS'] * 10
+        main(params)
+    except Exception as exception:
+        logger.exception(exception)
+        raise
diff --git a/examples/trials/mnist-hyperband/search_space.json b/examples/trials/mnist-hyperband/search_space.json
@@ -0,0 +1,7 @@
+{
+    "dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
+    "conv_size":{"_type":"choice","_value":[2,3,5,7]},
+    "hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
+    "batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
+    "learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
+}
diff --git a/pylintrc b/pylintrc
@@ -15,4 +15,4 @@ max-attributes=15
 const-naming-style=any
 
 disable=duplicate-code,
-        super-init-not-called
+        super-init-not-called
diff --git a/src/nni_manager/common/manager.ts b/src/nni_manager/common/manager.ts
@@ -35,7 +35,7 @@ interface ExperimentParams {
     trainingServicePlatform: string;
     multiPhase?: boolean;
     multiThread?: boolean;
-    tuner: {
+    tuner?: {
         className: string;
         builtinTunerName?: string;
         codeDir?: string;
@@ -53,6 +53,15 @@ interface ExperimentParams {
         checkpointDir: string;
         gpuNum?: number;
     };
+    advisor?: {
+        className: string;
+        builtinAdvisorName?: string;
+        codeDir?: string;
+        classArgs?: any;
+        classFileName?: string;
+        checkpointDir: string;
+        gpuNum?: number;
+    };
     clusterMetaData?: {
         key: string;
         value: string;