Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add workqueue metric #1407

Closed
qiankunli opened this issue Sep 16, 2021 · 3 comments
Closed

add workqueue metric #1407

qiankunli opened this issue Sep 16, 2021 · 3 comments

Comments

@qiankunli
Copy link
Contributor

tf-operator version :v1.2.1

我们发现一些tfjob 从创建到开始调度耗时很久,达到8分钟,初步结论是tf-operator cpu 配少了 导致消费workqueue 速度不够(tfjob 每天400+,pod 3800+),后来增加了cpu和 threadiness,现在看大致正常了。

Sep 16, 2021 @ 17:28:51.301 {"filename":"common/job.go:144","level":"info","msg":"Reconciling for job v1-tensorflow-0916171919266","time":"2021-09-16T09:28:51Z"}
 Sep 16, 2021 @ 17:19:29.115 {"filename":"tensorflow/job.go:118","job":"xdl-system.v1-tensorflow-0916171919266","level":"info","msg":"TFJob v1-tensorflow-0916171919266 is created.","time":"2021-09-16T09:19:29Z","uid":"e8b0d2c4-f876-4c54-b1e5-380fc4b4f92f"}

因此建议 将client-go workqueue 的metric加入到 tf-operator 的metric,这样分析延迟原因和证明问题解决 都方便一些,也便于用户作为调整cpu和 threadiness 的依据。

@gaocegege
Copy link
Member

SGTM.

/kind feature

We should add metrics for the workqueue

@gaocegege
Copy link
Member

/help-wanted

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants