Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

katib NAS trial created for feedforward architecture but no SUGGESTION was created #1561

Closed
Jaydeemourg opened this issue Jun 24, 2021 · 5 comments
Labels

Comments

@Jaydeemourg
Copy link

Jaydeemourg commented Jun 24, 2021

/kind bug

What steps did you take and what happened:
i created a training script to train my dnn model, then package the script in an image. A YAML file for the Neural Architecture Search (NAS) experiment was then created. upon running the experiment, the trials created as shown below but no suggestions were made; am i missing something

output of kubectl logs ai-nalyze-tuning-nas-0-enas-6f5578d676-7cdnk -n ki-user -c suggestion

Validate Algorithm Settings start
All Experiment Settings are Valid
----------------------------------------------------------------------------------------------------
Setting Up Suggestion for Experiment ai-nalyze-tuning-nas-0
----------------------------------------------------------------------------------------------------

>>> Search Space for Experiment ai-nalyze-tuning-nas-0
Operation ID: 
        0
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 1
        neurons: 32

Operation ID: 
        1
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 1
        neurons: 48

Operation ID: 
        2
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 1
        neurons: 64

Operation ID: 
        3
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 1
        neurons: 96

Operation ID: 
        4
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 1
        neurons: 128

Operation ID: 
        5
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 7
        neurons: 32

Operation ID: 
        6
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 7
        neurons: 48

Operation ID: 
        7
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 7
        neurons: 64

Operation ID: 
        8
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 7
        neurons: 96

Operation ID: 
        9
Operation Type: 
        feedforward
Operations Parameters:
        num-layers: 7
        neurons: 128

There are 10 operations in total.

>>> Parameters of LSTM Controller for Experiment ai-nalyze-tuning-nas-0

controller_hidden_size:         64
controller_temperature:         5.0
controller_tanh_const:          2.25
controller_entropy_weight:      1e-05
controller_baseline_decay:      0.999
controller_learning_rate:       5e-05
controller_skip_target:         0.4
controller_skip_weight:         0.8
controller_train_steps:         50
controller_log_every_steps:     10

>>> Building Controller

>>> Building Controller Parameters

>>> Controller has 42368 Trainable params

>>> Building Controller Sampler

>>> Suggestion for Experiment ai-nalyze-tuning-nas-0 has been initialized.

----------------------------------------------------------------------------------------------------
Suggestion Step 0 for Experiment ai-nalyze-tuning-nas-0
----------------------------------------------------------------------------------------------------

>>> RequestNumber:              2

>>> First time running suggestion for ai-nalyze-tuning-nas-0. Random architecture will be given.
2021-06-24 12:46:52.414608: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-06-24 12:46:52.419136: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2021-06-24 12:46:52.419496: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f5d98b89550 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-24 12:46:52.419517: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

>>> New Neural Network Architecture Candidate #0 (internal representation):
[[3], [1, 0], [0, 0, 1], [9, 0, 1, 0], [2, 0, 0, 1, 1], [7, 1, 1, 0, 1, 1], [8, 1, 1, 0, 1, 1, 0]]

>>> Corresponding Seach Space Description:
{'num_layers': 7, 'input_sizes': [11], 'output_sizes': [2], 'embedding': {'3': {'opt_id': 3, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '96'}}, '1': {'opt_id': 1, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '48'}}, '0': {'opt_id': 0, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '32'}}, '9': {'opt_id': 9, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '128'}}, '2': {'opt_id': 2, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '64'}}, '7': {'opt_id': 7, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '64'}}, '8': {'opt_id': 8, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '96'}}}}

>>> New Neural Network Architecture Candidate #1 (internal representation):
[[3], [6, 0], [8, 1, 1], [9, 1, 0, 0], [4, 0, 0, 0, 1], [7, 1, 1, 0, 0, 1], [6, 0, 1, 1, 1, 0, 1]]

>>> Corresponding Seach Space Description:
{'num_layers': 7, 'input_sizes': [11], 'output_sizes': [2], 'embedding': {'3': {'opt_id': 3, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '96'}}, '6': {'opt_id': 6, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '48'}}, '8': {'opt_id': 8, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '96'}}, '9': {'opt_id': 9, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '128'}}, '4': {'opt_id': 4, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '1', 'neurons': '128'}}, '7': {'opt_id': 7, 'opt_type': 'feedforward', 'opt_params': {'num-layers': '7', 'neurons': '64'}}}}

>>> 2 Trials were created for Experiment ai-nalyze-tuning-nas-0

YAML content;

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: ki-user
  name: ai-nalyze-tuning-nas-0
spec:
  parallelTrialCount: 2
  maxTrialCount: 3
  maxFailedTrialCount: 2
  objective:
    type: maximize
    goal: 0.90
    objectiveMetricName: Categorical_accuracy_eval
  algorithm:
    algorithmName: enas
  nasConfig:
    graphConfig:
      numLayers: 7
      inputSizes:
        - 11
      outputSizes:
        - 2
    operations:
      - operationType: feedforward
        parameters:
          - name: num-layers
            parameterType: categorical
            feasibleSpace:
              list:
                - "1"
                - "7"
          - name: neurons
            parameterType: categorical
            feasibleSpace:
              list:
                - "32"
                - "48"
                - "64"
                - "96"
                - "128"
  trialTemplate:
    primaryContainerName: ai-nalyze-training-container
    trialParameters:
      - name: numberLayers
        description: Number of training model layers
        reference: num-layers
      - name: neurons
        description: Anzahl der Neuronen
        reference: neurons
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            containers:
              - name: ai-nalyze-training-container
                image: localhost:32000/katibtrainerainalyzenas:v1.0
                command:
                  - python3
                  - -u
                  - aiNalyze_training_Nas.py
                  - --num-layers="${trialParameters.numberLayers}"
                  - --neurons="${trialParameters.neurons}"
            restartPolicy: Never

What did you expect to happen:
the experiment should run successfully

Anything else you would like to add:
Environment:

Kubeflow version: 1.3
Kubernetes version: (use kubectl version --short): v1.20.7

@Jaydeemourg Jaydeemourg changed the title katib NAS trial on a feedforward architecture but it failed katib NAS trial created for a feedforward architecture but no SUGGESTION was created Jun 24, 2021
@Jaydeemourg Jaydeemourg changed the title katib NAS trial created for a feedforward architecture but no SUGGESTION was created katib NAS trial created for feedforward architecture but no SUGGESTION was created Jun 24, 2021
@Jaydeemourg
Copy link
Author

this is the view on the Katib NAS UI:
image

@andreyvelich
Copy link
Member

Hi @Jaydeemourg and thank you for testing ENAS Algorithm.
From the Suggestion logs I can see that 2 candidates were properly created.

ENAS Suggestion creates candidates with Model Architecture and NN config based on your Operation Search Space. You can read more about it here: https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/enas

Then, the input of your Trial container must be --architecture and --nn_config.

In you Trial Training container you should have Model Constructor, like this one: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cnn-cifar10/ModelConstructor.py, which creates model based on Architecture and NN Config.

After that, you can train you model and metrics collector will collect the reward for the Suggestion to train ENAS controller.

@Jaydeemourg
Copy link
Author

Hello @andreyvelich, thanks for your quick response. i assigned the --architecture and --nn_config as input to my container without Model Constructor. my model building and trained are in the .py script used as Entrypoint of the container. Now i can view the suggestions in the Katib NAS UI. Do i actually need the Model Constructor for a feedforward network like mine since there are no special operation to be done? the example you posted utilized it for CNN.

Here is the best architecture;
As i did not use GlobalAveragePooling in my code, why is it shown on the Graph?

image

Output of kubectl get experiment ai-nalyze-tuning-nas-3 -n ki-user -o yaml:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2021-06-25T10:02:23Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  managedFields:
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:algorithm:
          .: {}
          f:algorithmName: {}
        f:maxFailedTrialCount: {}
        f:maxTrialCount: {}
        f:nasConfig:
          .: {}
          f:graphConfig:
            .: {}
            f:inputSizes: {}
            f:numLayers: {}
            f:outputSizes: {}
          f:operations: {}
        f:objective:
          .: {}
          f:goal: {}
          f:objectiveMetricName: {}
          f:type: {}
        f:parallelTrialCount: {}
        f:trialTemplate:
          .: {}
          f:primaryContainerName: {}
          f:trialParameters: {}
          f:trialSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:spec:
              .: {}
              f:template:
                .: {}
                f:metadata:
                  .: {}
                  f:annotations:
                    .: {}
                    f:sidecar.istio.io/inject: {}
                f:spec:
                  .: {}
                  f:containers: {}
                  f:restartPolicy: {}
    manager: katib-ui
    operation: Update
    time: "2021-06-25T10:02:23Z"
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers: {}
      f:status:
        .: {}
        f:completionTime: {}
        f:conditions: {}
        f:currentOptimalTrial:
          .: {}
          f:bestTrialName: {}
          f:observation:
            .: {}
            f:metrics: {}
          f:parameterAssignments: {}
        f:startTime: {}
        f:succeededTrialList: {}
        f:trials: {}
        f:trialsSucceeded: {}
    manager: katib-controller
    operation: Update
    time: "2021-06-25T10:05:52Z"
  name: ai-nalyze-tuning-nas-3
  namespace: ki-user
  resourceVersion: "15211479"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/ki-user/experiments/ai-nalyze-tuning-nas-3
  uid: 80b1205c-d06d-4161-a45c-d7b78b0de95a
spec:
  algorithm:
    algorithmName: enas
  maxFailedTrialCount: 2
  maxTrialCount: 3
  metricsCollectorSpec:
    collector:
      kind: StdOut
  nasConfig:
    graphConfig:
      inputSizes:
      - 11
      numLayers: 7
      outputSizes:
      - 2
    operations:
    - operationType: feedforward
      parameters:
      - feasibleSpace:
          max: "7"
          min: "1"
          step: "1"
        name: num-layers
        parameterType: int
      - feasibleSpace:
          max: "50"
          min: "5"
          step: "1"
        name: neurons
        parameterType: int
      - feasibleSpace:
          max: "128"
          min: "32"
          step: "32"
        name: batch-size
        parameterType: int
      - feasibleSpace:
          list:
          - sgd
          - adam
          - ftrl
        name: optimizer
        parameterType: categorical
  objective:
    goal: 0.9
    metricStrategies:
    - name: Categorical_accuracy_eval
      value: max
    objectiveMetricName: Categorical_accuracy_eval
    type: maximize
  parallelTrialCount: 2
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: ai-nalyze-training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: NN architecture contains operations ID on each NN layer and skip
        connections between layers
      name: neuralNetworkArchitecture
      reference: architecture
    - description: Configuration contains NN number of layers, input and output sizes,
        description what each operation ID means
      name: neuralNetworkConfig
      reference: nn_config
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - command:
              - python3
              - -u
              - aiNalyze_training_Nas.py
              - --architecture="${trialParameters.neuralNetworkArchitecture}"
              - --nn_config="${trialParameters.neuralNetworkConfig}"
              image: localhost:32000/katibtrainerainalyzenas:v2.0
              name: ai-nalyze-training-container
            restartPolicy: Never
status:
  completionTime: "2021-06-25T10:05:52Z"
  conditions:
  - lastTransitionTime: "2021-06-25T10:02:25Z"
    lastUpdateTime: "2021-06-25T10:02:25Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-06-25T10:05:52Z"
    lastUpdateTime: "2021-06-25T10:05:52Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2021-06-25T10:05:52Z"
    lastUpdateTime: "2021-06-25T10:05:52Z"
    message: Experiment has succeeded because max trial count has reached
    reason: ExperimentMaxTrialsReached
    status: "True"
    type: Succeeded
  currentOptimalTrial:
    bestTrialName: ai-nalyze-tuning-nas-3-tbtk6d8h
    observation:
      metrics:
      - latest: "0.47846153378486633"
        max: "0.7200000286102295"
        min: "0.3061538338661194"
        name: Categorical_accuracy_eval
    parameterAssignments:
    - name: architecture
      value: '[[3611], [527, 0], [2687, 0, 1], [1980, 0, 1, 1], [1334, 0, 1, 1, 1],
        [1565, 0, 1, 0, 0, 1], [1310, 1, 1, 1, 1, 0, 0]]'
    - name: nn_config
      value: '{''num_layers'': 7, ''input_sizes'': [11], ''output_sizes'': [2], ''embedding'':
        {''3611'': {''opt_id'': 3611, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 7, ''neurons'': 29, ''batch-size'': 128, ''optimizer'': ''ftrl''}},
        ''527'': {''opt_id'': 527, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 1, ''neurons'': 48, ''batch-size'': 128, ''optimizer'': ''ftrl''}},
        ''2687'': {''opt_id'': 2687, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 5, ''neurons'': 44, ''batch-size'': 128, ''optimizer'': ''ftrl''}},
        ''1980'': {''opt_id'': 1980, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 4, ''neurons'': 32, ''batch-size'': 32, ''optimizer'': ''sgd''}},
        ''1334'': {''opt_id'': 1334, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 3, ''neurons'': 24, ''batch-size'': 32, ''optimizer'': ''ftrl''}},
        ''1565'': {''opt_id'': 1565, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 3, ''neurons'': 43, ''batch-size'': 64, ''optimizer'': ''ftrl''}},
        ''1310'': {''opt_id'': 1310, ''opt_type'': ''feedforward'', ''opt_params'':
        {''num-layers'': 3, ''neurons'': 22, ''batch-size'': 32, ''optimizer'': ''ftrl''}}}}'
  startTime: "2021-06-25T10:02:25Z"
  succeededTrialList:
  - ai-nalyze-tuning-nas-3-579b8675
  - ai-nalyze-tuning-nas-3-lbxd5dnv
  - ai-nalyze-tuning-nas-3-tbtk6d8h
  trials: 3
  trialsSucceeded: 3

@andreyvelich
Copy link
Member

Do i actually need the Model Constructor for a feedforward network like mine since there are no special operation to be done?

If you have feedforward network with different Hypeprameters, like num-layers, batch-size, etc.., why do you want to use ENAS ?
You can perform different HP algorithms to search for the best parameters. Usually, ENAS is used for searching CNN or RNN when you want to search for the various operations. Check the paper: https://arxiv.org/pdf/1802.03268.pdf.

As i did not use GlobalAveragePooling in my code, why is it shown on the Graph?

By default for each CNN network we add these 2 layers at the end according to the paper. These layers are just added in the UI: https://github.com/kubeflow/katib/blob/master/pkg/ui/v1beta1/util.go#L282-L287.
Your training container has the original network that you are training.

@Jaydeemourg
Copy link
Author

@andreyvelich thanks for your response. i will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants