add workqueue metric #1407

qiankunli · 2021-09-16T11:55:33Z

tf-operator version :v1.2.1

我们发现一些tfjob 从创建到开始调度耗时很久，达到8分钟，初步结论是tf-operator cpu 配少了导致消费workqueue 速度不够（tfjob 每天400+，pod 3800+），后来增加了cpu和 threadiness，现在看大致正常了。

Sep 16, 2021 @ 17:28:51.301 {"filename":"common/job.go:144","level":"info","msg":"Reconciling for job v1-tensorflow-0916171919266","time":"2021-09-16T09:28:51Z"}
 Sep 16, 2021 @ 17:19:29.115 {"filename":"tensorflow/job.go:118","job":"xdl-system.v1-tensorflow-0916171919266","level":"info","msg":"TFJob v1-tensorflow-0916171919266 is created.","time":"2021-09-16T09:19:29Z","uid":"e8b0d2c4-f876-4c54-b1e5-380fc4b4f92f"}

因此建议将client-go workqueue 的metric加入到 tf-operator 的metric，这样分析延迟原因和证明问题解决都方便一些，也便于用户作为调整cpu和 threadiness 的依据。

The text was updated successfully, but these errors were encountered:

gaocegege · 2021-09-17T02:27:23Z

SGTM.

/kind feature

We should add metrics for the workqueue

gaocegege · 2021-09-17T02:27:36Z

/help-wanted

stale · 2022-03-02T11:12:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

google-oss-robot added the kind/feature label Sep 17, 2021

gaocegege added the help wanted label Sep 17, 2021

stale bot added the lifecycle/stale label Mar 2, 2022

stale bot closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add workqueue metric #1407

add workqueue metric #1407

qiankunli commented Sep 16, 2021

gaocegege commented Sep 17, 2021

gaocegege commented Sep 17, 2021

stale bot commented Mar 2, 2022

add workqueue metric #1407

add workqueue metric #1407

Comments

qiankunli commented Sep 16, 2021

gaocegege commented Sep 17, 2021

gaocegege commented Sep 17, 2021

stale bot commented Mar 2, 2022