Skip to content

Conversation

@Fiona-Waters
Copy link
Contributor

What this PR does / why we need it:
While testing use of a custom image in the container backend I noticed that the image was not being used and instead the default one was being picked up. This PR will fix this and ensure that the TrainingRuntimeSource will correctly use the image specified in the users ClusterTrainingRuntime yaml.

cc @andreyvelich @astefanutti @kramaranya

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

@Fiona-Waters Fiona-Waters force-pushed the image-issue branch 3 times, most recently from 002ffd5 to 07d86bf Compare November 4, 2025 10:21
Copy link
Member

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Fiona-Waters!
/lgtm

raise ValueError(f"Runtime {name} from {source} 'node' must specify containers[0].image")

# Extract the container image
image = containers[0].get("image")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be that important in practice, but maybe more robust to select the container named "node".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've updated to select the container named "node". PTAL.

…ckend

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 19070400017

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 79.621%

Totals Coverage Status
Change from base Build 19068536754: 0.0%
Covered Lines: 168
Relevant Lines: 211

💛 - Coveralls

@astefanutti
Copy link
Contributor

Thanks @Fiona-Waters!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit b87b81b into kubeflow:main Nov 4, 2025
13 of 14 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Nov 4, 2025
name: str
trainer: RuntimeTrainer
pretrained_model: Optional[str] = None
image: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @Fiona-Waters, shall we move this image under RuntimeTrainer ?
Since we also have initializer in the Runtime.
Also, we might need to update Kubernetes backend to also populate this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich right that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can look at creating a follow on PR to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already got there :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants