-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify GPU configuration process. #9
Conversation
@wbuchwalter Any interest in reviewing this before I merge it? |
@jlewi Yes, doing that now :) |
) | ||
|
||
func init() { | ||
// chaos level will be removed once we have a formal tool to inject failures. | ||
flag.IntVar(&chaosLevel, "chaos-level", -1, "DO NOT USE IN PRODUCTION - level of chaos injected into the etcd clusters created by the operator.") | ||
flag.BoolVar(&printVersion, "version", false, "Show version and quit") | ||
flag.DurationVar(&gcInterval, "gc-interval", 10*time.Minute, "GC interval") | ||
flag.StringVar(&controllerConfigFile, "controller_config_file", "", "Path to file containing the controller config.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace controller_config_file
by controller-config-file
to stick with convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name in the config map to match this one.
return fmt.Errorf("Replica is missing Template; %v", util.Pformat(r)) | ||
} | ||
for i, c := range r.Template.Spec.Containers { | ||
if c.Name == string(TENSORFLOW) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why we specifically need to ensure that the container is named "tensorflow", but as this is not related to this PR we can discuss later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to allow for more than 1 container in the pod. The original reason for that was to handle logging via a sidecar container. If we don't need to allow for more than 1 container than we can drop this requirement. Alternatively, we could introduce some other method to identify the container that is running TensorFlow.
@jlewi Looks good! Left 2 minors comments.This is a really cool feature. |
* Job controller is configured with a list of volumes that need to be added to pods that use GPUs. * Controller looks at pods that specify GPU resources and adds volume mounts needed for GPUs.
@wbuchwalter Thank you. Do you want to take a look at my changes to the README? |
This looks good to me! |
Thanks. |
Modify the way of delete and create pod for gang jobs.
Job controller is configured with a list of volumes that need to be added to pods that use GPUs.
Controller looks at pods that specify GPU resources and adds volume mounts needed for GPUs.