Skip to content

Conversation

@shaikmoeed
Copy link

What this PR does / why we need it:
Add support to list/get namespaced TrainingRuntime.

Which issue(s) this PR fixes:

Fixes #88

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaikmoeed shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 24f00a7 to 1740535 Compare October 29, 2025 07:36
@shaikmoeed shaikmoeed changed the title feat(backend): Support namespaced TrainingRuntime in the SDK feat(trainer): Support namespaced TrainingRuntime in the SDK Oct 29, 2025
@kramaranya
Copy link
Contributor

/ok-to-test

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
@shaikmoeed shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 8f0b6d5 to de2ad1b Compare November 3, 2025 13:51
Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
@abhijeet-dhumal
Copy link
Contributor

abhijeet-dhumal commented Nov 3, 2025

Thank you @shaikmoeed for this!
Left some nit-picks..


def get_runtime(self, name: str) -> types.Runtime:
"""Get the the Runtime object"""
"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""
"""Get the Runtime object prefer namespaced, fall-back to cluster-scoped"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change goes for each occurence here

)

cluster_runtime_list = models.TrainerV1alpha1ClusterTrainingRuntimeList.from_dict(
cluster_thread.get(constants.DEFAULT_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster_thread.get(constants.DEFAULT_TIMEOUT)
cluster_thread.get(common_constants.DEFAULT_TIMEOUT)

Comment on lines +503 to +533
def create_training_runtime(
name: str,
namespace: str = "default",
) -> models.TrainerV1alpha1TrainingRuntime:
"""Create a mock namespaced TrainingRuntime object (not cluster-scoped)."""
return models.TrainerV1alpha1TrainingRuntime(
apiVersion=constants.API_VERSION,
kind="TrainingRuntime",
metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=name,
namespace=namespace,
labels={constants.RUNTIME_FRAMEWORK_LABEL: name},
),
spec=models.TrainerV1alpha1TrainingRuntimeSpec(
mlPolicy=models.TrainerV1alpha1MLPolicy(
torch=models.TrainerV1alpha1TorchMLPolicySource(
numProcPerNode=models.IoK8sApimachineryPkgUtilIntstrIntOrString(2)
),
numNodes=2,
),
template=models.TrainerV1alpha1JobSetTemplateSpec(
metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=name,
namespace=namespace,
),
spec=models.JobsetV1alpha2JobSetSpec(replicatedJobs=[get_replicated_job()]),
),
),
)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to create this in kubernetes/backend_test.py?
this is not a test function and I believe it should be added to the TrainerClient and propagated to the different backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support namespaced TrainingRuntime in the SDK

4 participants