Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment controller is not showing any events when fails to reconcile all trials #1663

Closed
henrysecond1 opened this issue Sep 11, 2021 · 0 comments · Fixed by #1706
Closed
Labels

Comments

@henrysecond1
Copy link
Contributor

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

The experiment controller is not showing any events when fails to reconcile all trials.

For example, consider the situation the trial parameter reference is misconfigured as below. Assume that parameter is given as num-layers, and if we do not correctly set its reference as num-layer (typo) in trialParameters , all trials fail to be created.

parameters:
  - feasibleSpace:
    ...
    name: num-layers
    parameterType: int
trialTemplate:
    ...
    trialParameters:
    - name: numberLayers
      reference: num-layer # typo

We can check the reason for the failure in the controller log. However, users not authorized to access the controller can not find the reason that why their trials are not created since no events are emitted by the experiment controller.

$ kubectl describe experiment random-experiment -n user
...
Status:
  Completion Time:  <nil>
  Conditions:
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
  Current Optimal Trial:
    Observation:
Events:              <none>

What did you expect to happen:

The experiment controller emits events when fails to reconcile all trials.

Anything else you would like to add:

Relevant logs in Katib controller

Fail to get RunSpec from experiment","Experiment":"user/random-experiment","error":"Unable to find parameter: num-layer in parameter assignment map[lr:0.026271422193467404 num-layers:5 optimizer:sgd

Environment:

  • Kubeflow version (kfctl version): v1.3
  • Kubernetes version: (use kubectl version): v1.18.10
  • OS (e.g. from /etc/os-release): CentOS 7.9

If it's okay, I'd like to contribute to solving the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants