Skip to content

[SDK] Get the correct TrainJob components using get_job() API #25

@andreyvelich

Description

@andreyvelich

What you would like to be added?

As we discussed, currently get_job() API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: kubeflow/trainer#2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.
Therefore, users can see unexpected logs while using the Kubeflow Training SDK.

We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions