Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdout metric collector failed #1576

Closed
chenwenjun-github opened this issue Jul 10, 2021 · 5 comments · Fixed by #1614
Closed

stdout metric collector failed #1576

chenwenjun-github opened this issue Jul 10, 2021 · 5 comments · Fixed by #1614
Labels
help wanted Extra attention is needed kind/bug priority/p1

Comments

@chenwenjun-github
Copy link
Contributor

chenwenjun-github commented Jul 10, 2021

/kind bug

What steps did you take and what happened:
image

I use tfjob as trial's job, my tfjob has one ps, one chief, one worker, and the metric collector is stdout, but I find that the metrics-logger-and-collector container sometimes will become error, and the print like above, this isn't must present.
But this will cause that the worker‘s metric can't be collected.

this error message in code like this:
image

can you give me some advice to aviod this problem?

What did you expect to happen:
don't appear this error

Environment:

  • katib version (kfctl version): v0.10.1
  • Kubernetes version: (use kubectl version): 1.13
@chenwenjun-github
Copy link
Contributor Author

I made some change on katib code like this:
image
This reslove my problem, close this issue.

@andreyvelich
Copy link
Member

Thank you for creating this @chenwenjun-github.
Yes, that should be an issue.
Maybe we should implement this change to the metrics collector source code?

WDYT @chenwenjun-github @gaocegege @johnugeorge ?

/priority p1
/help

@andreyvelich andreyvelich reopened this Jul 13, 2021
@google-oss-robot
Copy link

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Thank you for creating this @chenwenjun-github.
Yes, that should be an issue.
Maybe we should implement this change to the metrics collector source code?

WDYT @chenwenjun-github @gaocegege @johnugeorge ?

/priority p1
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-robot google-oss-robot added priority/p1 help wanted Extra attention is needed labels Jul 13, 2021
@johnugeorge
Copy link
Member

Sorry. I didn't understand the context very well.

when will metrics-logger-and-collector container error out?

@andreyvelich
Copy link
Member

andreyvelich commented Jul 13, 2021

Sorry. I didn't understand the context very well.

when will metrics-logger-and-collector container error out?

This happens very rarely, when psutil.Pids() returns process, but when we try to create an object from this PID in psutil.NewProcess(pid) it can't be created.
My assumption that, this process might be stopped or finished before.

I think @chenwenjun-github solution above might work, since usually we wait only until mainPid is complete.
I would only verify that mainPid contains the PID after GetMainProcesses is finished.

We can discuss about it on our upcoming WG meeting tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/bug priority/p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants