Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

SecretSun · 2024-08-15T11:20:03Z

What you would like to be added?

tf-job-operator v1.0 metrics can expose specific failed pods

The logging details are as follows

content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-consume-v2-update-2024-08-20-053447\', UID:\'803a4aca-561b-4609-9ee3-8953f075b66c\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'860397344\', FieldPath:\'\'}): type: 'Normal' reason: 'ExitedWithCode' Pod: iem-trs-training.android-consume-v2-update-2024-08-20-053447-worker-0 exited with code 1','time':'2024-08-20T22:35:42Z'}"

Why is this needed?

It is necessary to quickly locate the training node with specific problems

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich · 2024-08-28T16:45:41Z

Thank you for creating this @SecretSun!
Are you looking to improve the Failed message in TFJob events: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L501-L503 ?

/area monitoring
/remove-label lifecycle/needs-triage

SecretSun · 2024-09-10T02:49:53Z

content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-uni-item-eval-2024-09-09-043604\', UID:\'f31ff823-5904-41cb-b490-e368426e8061\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'906983038\', FieldPath:\'\'}): type: 'Normal' reason: 'TFJobFailed' TFJob android-uni-item-eval-2024-09-09-043604 has failed because 1 Worker replica(s) failed.','time':'2024-09-09T20:52:45Z'}"

It is possible that there is no understanding of the meaning, whether the support can be specific to that worker

SecretSun added kind/feature lifecycle/needs-triage labels Aug 15, 2024

google-oss-prow bot added area/monitoring and removed lifecycle/needs-triage labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

SecretSun commented Aug 15, 2024 •

edited

Loading

andreyvelich commented Aug 28, 2024

SecretSun commented Sep 10, 2024

Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

Comments

SecretSun commented Aug 15, 2024 • edited Loading

What you would like to be added?

Why is this needed?

Love this feature?

andreyvelich commented Aug 28, 2024

SecretSun commented Sep 10, 2024

SecretSun commented Aug 15, 2024 •

edited

Loading