Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify GPU configuration process. #9

Merged
merged 1 commit into from
Aug 1, 2017
Merged

Simplify GPU configuration process. #9

merged 1 commit into from
Aug 1, 2017

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Jul 31, 2017

  • Job controller is configured with a list of volumes that need to be added to pods that use GPUs.

  • Controller looks at pods that specify GPU resources and adds volume mounts needed for GPUs.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 31, 2017

@wbuchwalter Any interest in reviewing this before I merge it?

@jlewi jlewi mentioned this pull request Jul 31, 2017
@wbuchwalter
Copy link
Contributor

@jlewi Yes, doing that now :)

)

func init() {
// chaos level will be removed once we have a formal tool to inject failures.
flag.IntVar(&chaosLevel, "chaos-level", -1, "DO NOT USE IN PRODUCTION - level of chaos injected into the etcd clusters created by the operator.")
flag.BoolVar(&printVersion, "version", false, "Show version and quit")
flag.DurationVar(&gcInterval, "gc-interval", 10*time.Minute, "GC interval")
flag.StringVar(&controllerConfigFile, "controller_config_file", "", "Path to file containing the controller config.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace controller_config_file by controller-config-file to stick with convention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name in the config map to match this one.

return fmt.Errorf("Replica is missing Template; %v", util.Pformat(r))
}
for i, c := range r.Template.Spec.Containers {
if c.Name == string(TENSORFLOW) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why we specifically need to ensure that the container is named "tensorflow", but as this is not related to this PR we can discuss later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to allow for more than 1 container in the pod. The original reason for that was to handle logging via a sidecar container. If we don't need to allow for more than 1 container than we can drop this requirement. Alternatively, we could introduce some other method to identify the container that is running TensorFlow.

@wbuchwalter
Copy link
Contributor

@jlewi Looks good! Left 2 minors comments.This is a really cool feature.
Only thing missing is some documentation to explain how to customize the tf-job-operator-config.

* Job controller is configured with a list of volumes that need to be added to pods that use GPUs.

* Controller looks at pods that specify GPU resources and adds volume mounts needed for GPUs.
@jlewi
Copy link
Contributor Author

jlewi commented Jul 31, 2017

@wbuchwalter Thank you. Do you want to take a look at my changes to the README?

@wbuchwalter
Copy link
Contributor

This looks good to me!

@jlewi
Copy link
Contributor Author

jlewi commented Aug 1, 2017

Thanks.

@jlewi jlewi merged commit dbc8009 into master Aug 1, 2017
@jhseu jhseu deleted the gpus branch November 3, 2017 23:29
ChenYi015 pushed a commit to ChenYi015/training-operator that referenced this pull request Sep 20, 2024
Modify the way of delete and create pod for gang jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants