-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220
Comments
Thank you for creating this @SecretSun! /area monitoring |
It is possible that there is no understanding of the meaning, whether the support can be specific to that worker |
What you would like to be added?
tf-job-operator v1.0 metrics can expose specific failed pods
The logging details are as follows
content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-consume-v2-update-2024-08-20-053447\', UID:\'803a4aca-561b-4609-9ee3-8953f075b66c\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'860397344\', FieldPath:\'\'}): type: 'Normal' reason: 'ExitedWithCode' Pod: iem-trs-training.android-consume-v2-update-2024-08-20-053447-worker-0 exited with code 1','time':'2024-08-20T22:35:42Z'}"
Why is this needed?
It is necessary to quickly locate the training node with specific problems
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: