Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Metrics not available" problem with basic v1alpha3 deployment #1082

Closed
kunalyogenshah opened this issue Mar 11, 2020 · 7 comments
Closed

"Metrics not available" problem with basic v1alpha3 deployment #1082

kunalyogenshah opened this issue Mar 11, 2020 · 7 comments
Labels
help wanted Extra attention is needed kind/bug

Comments

@kunalyogenshah
Copy link

/kind bug

What steps did you take and what happened:
I have followed the steps in the README to create a Katib deployment using the kustomization in the manifests repo. (I do not have Kubeflow and/or Pytorch and TFJob manifests applied, only the Katib ones). The deploy went through successfully. But once I create the random-experiment example, it fails to finish the Trials with the following error

Events:
  Type     Reason              Age                    From              Message
  ----     ------              ----                   ----              -------
  Normal   JobCreated          2m17s                  trial-controller  Job random-example-6dszbxs5 has been created
  Normal   JobRunning          2m17s (x2 over 2m17s)  trial-controller  Job random-example-6dszbxs5 is running:
  Warning  MetricsUnavailable  117s (x2 over 117s)    trial-controller  Metrics are not available for Job random-example-6dszbxs5

The logs from the job pod for this trial shows a complete run with the logs having the required information.

2020-03-11T15:10:03Z INFO     Epoch[9] Batch [800-900]	Speed: 51816.31 samples/sec	accuracy=0.111875
2020-03-11T15:10:03Z INFO     Epoch[9] Train-accuracy=0.112423
2020-03-11T15:10:03Z INFO     Epoch[9] Time cost=1.140
2020-03-11T15:10:03Z INFO     Epoch[9] Validation-accuracy=0.113854

How do I debug why this happened?

What did you expect to happen:
The metrics collector to run successfully and end the experiment.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version: Not installed
  • Minikube version: Not on Minikube
  • Kubernetes version: (use kubectl version): v1.17
  • OS (e.g. from /etc/os-release):
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug label Mar 11, 2020
@kunalyogenshah
Copy link
Author

/help

@k8s-ci-robot
Copy link

@kunalyogenshah:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Extra attention is needed label Mar 11, 2020
@jlewi jlewi removed the bug label Mar 11, 2020
@andreyvelich
Copy link
Member

@kunalyogenshah Can you describe one of your training pods, please?

@kunalyogenshah
Copy link
Author

Hey @andreyvelich . Sorry for the confusion, this one was resolved on our other thread #981. This one can be closed. Thank you for all the help!

@andreyvelich
Copy link
Member

@kunalyogenshah Sure, I will close this.

/close

@k8s-ci-robot
Copy link

@andreyvelich: Closing this issue.

In response to this:

@kunalyogenshah Sure, I will close this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/bug
Projects
None yet
Development

No branches or pull requests

4 participants