-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tf-operator design doc for API v1alpha2 #30
Add tf-operator design doc for API v1alpha2 #30
Conversation
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, mostly request for further discussion
|
||
**E2E Test** | ||
|
||
We can use this model from TensorFlow [repo](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dist_test) for e2e test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe example e2e workflow? From beginning (empty jupyter) to the end (multinode tensorflow serving)?
I think it would be useful to have discussion (separate, dedicated doc?) how do we want to connect jupyter to tf-operator with user-experience in mind.
Another thing to consider which might affect this design is how do we transfer artifacts (code and trained models) between steps in workflow - I (data scientist) write code in jupyter -> somehow export it to place where tf-job (learning) can access it -> tf-job runs same code in distributed and scaled fashion -> tf-job saves model somewhere -> tf-serving runs in scaled fashion and consumes trained model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is for tf-operator, and tf-operator does not know the jupyter since it is not in tf-operator's management. The operator only knows the YAML config file and I think the e2e test from jupyter to serving should be done in kubeflow/kubeflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inc0 Thanks for your high-level comments, it makes totally sense for me.
For e2e workflow we can re-use kubernetes/test-infra:
- Deploy a local/remote kubernetes cluster.
- Deploy TFJob controller and create CRD.
- Run e2e test case:
- Load e2e test model from scratch.
- Create TFJob yaml and deploy it.
- Wait and check the result.
IMHO, put e2e design into a dedicated doc maybe better.
And i think TensorFlow serving is a very important feature of tf-operator, but i don't take something to think about it now :)
+ For example, `worker0.example.com:2222,worker1.example.com:2222,worker2.example.com:2222` | ||
- `job_name`: job name: worker or ps. | ||
- `task_index`: task index, should be >= 0. | ||
+ For example, worker task_index = 0 is the `chief` worker task the performs the variable initialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful for these configurations to be env-dependant. For example we could run same job in local (notebook) single node for smoke testing and first metrics and then, ideally without code change, run it distributed to do actual learning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, and I think we already implement it since the Kubernetes could schedule all tasks of the job in one node if we only have one node.
|
||
**Auto-Generated Arguments:** | ||
|
||
To make distributed TensorFlow work, user **MUST** implement a parser in their code like `ArgumentParser` to get the distributed TensorFlow configurations which generated by `tf-operator`, the common built-in arguments are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to pass these in as environment variables instead of command line arguments. Seems it would be more flexible, as the user code can ignore these if they don't care about them, rather than being forced to parse them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we could do both, or even add config file into the mix - CLI overriding ENV overriding config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Rong's suggestion of using environment variables and more specifically the environment variable TF_CONFIG.
TensorFlow has already adopted the environment variabble TF_CONFIG
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L84
as a convention for passing runtime information (e.g. # of workers) to libraries like tf.Estimator that can process that information.
Why would we try to define a new convention in the form of a set of specific command line arguments?
Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"? What if code is using tf.Estimator and therefore can automatically use TF_CONFIG and strict argument parsing. Then the argument parser would raise an exception because of unrecognized command line arguments. So users would have to define arguments specific to TfJob operator even though the user's code was never using these arguments.
new convention in the form of a set of specific command line arguments?
Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some discussions in kubeflow/training-operator#369:
TF_CONFIG is a TensorFlow convention that TF APIs like the EstimatorAPI use to get information about the runtime environment and configure the job appropriately.
If users want to use command line arguments, they can write a launcher script that parses TF_CONFIG and sets the command line arguments as needed. This is what we do in the TFCNN example; see https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py
I think the problem with command line arguments is that everyone will use slightly different conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Hi, any tensorflow official docs about TF_CONFIG
, I check the example from tensorflow website: https://www.tensorflow.org/deploy/distributed.
And i notice that Run Config (deprecated, use tf.estimator.RunConfig instead).
in here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if user don not use estimator
library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig
The RunConfig docs say that it will automatically be populated based on the environment variable TF_CONFIG if it is set.
My hope is that most high level libraries will follow the convention of using the environment variable TF_CONFIG to automatically configure themselves.
As @gaocegege points out above. If a user is using a library that doesn't use TF_CONFIG then they have two options
- Modify the code to read and parse TF_CONFIG
- Create a launcher script (possibly using an init container) to parse TF_CONFIG and set command line arguments appropriately.
I really think we should try to avoid introducing new conventions (e.g. standard command line flags) when the TF community already has an existing convention.
Furthermore, when I look at TF code I don't see broad agreement about what command line flags should be used or how they should be structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Thanks, will take a deeper look at this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi After investigating, TF_CONFIG
is used for tf.estimators
, which is a high level encapsulation. How about pass the both TF_CONFIG
and ClusterSpec
together?
- Generate
TF_CONFIG
JSON string and set it inEnv
field. - Generate
ClusterSpec
args and set it inArgs
field.
So both estimators model and other tensorflow model can work together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi For TF_CONFIG
i mean Env
filed, sorry for the typo.
|
||
### Error Handling | ||
|
||
To make the system robust, the tf-operator should be able to locally and automatically recover from errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about retryable vs. permanent errors? The current implementation uses exit codes to distinguish between retryable and permanent errors? Should we continue to use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think this is open to discuss. IMHO, separate retryable and permanent errors maybe not easy, so let user to decide this maybe better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I agree user should decide whether an error is retryable or not, but that needs to be communicated to the TFJob operator so that it can determine what to do.
For example, here are two different cases where TF could be terminated
- The VM running the pod is terminated or becomes unhealthy
- TF exits with an exception because user tries to read a file that doesn't exist.
My claim is that in the first case TF operator should restart TF since its a retryable error but in second case the error is most likely permanent (file is unlikely to exist if we keep retrying) so the job should fail.
Exit codes provide a mechanism for distinguishing different types of errors. User has some control over the exit code. For example they can wrap their code in a try/catch block and turn different exceptions into exit codes corresponding to retryable or permanent errors as needed.
This isn't great but its fairly straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
-
In case 1, AFAIK this case will create a node
NotReady
condition, kubernetes will re-schedule all pods which running on this node to another node after a while. It is handled automatically by kubernetes system, not controlled by any user defined config (e.g. RestartPolicy or exit code). -
In case 2, this is an error in container, so let use handle this error maybe better, by setting RestartPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case 1, I don't think K8s will reschedule the pods. Pods aren't durable
https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.
In case 2. I don't think we can do this with the built in RestartPolicy because the only options are Never or Always and I don't think that's flexible enough.
// TFJobStatus represents the current observed state of the TFJob. | ||
type TFJobStatus struct { | ||
// Phase is the recently observed lifecycle phase of the TFJob. | ||
Phase TFJobPhase `json:"phase"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Phase if we have conditions? I thought @gaocegege or someone else suggested that K8s was moving away from using Phase and using conditions instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phase is required but we should use condition to determine the the state of the TFJob. These two fields are all needed, IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaocegege Required by K8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi By convention, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: kubeflow/training-operator#223 as quoted in their from the docs
Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated.
My interpretation of this is that not all V1 resources contain phase. So I don't think we should include phase.
I think the information in phase overlaps with conditions. So by providing phase and conditions are API isn't as clean as it could be because users will need to figure out which one to look at.
- Define the structure of API `v1alpha2`. | ||
+ Cover most of the refactoring requests we have discussed. | ||
+ Simplify the API definition. | ||
- Define an `event-driven` mechanism for TFJob life-cycle management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to split the implementation into its own proposal as opposed to using 1 proposal for the implementation and API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are bundled. If we do not migrate the implementation to the new design we have to implement twice, once for etcd-operator style and once for event-driven style. I think it is unnecessary. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this proposal follows the proposal-template, it makes sense for me to put API and design together in one proposal :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaocegege @ScorpioCPH Splitting the proposals doesn't mean implementing them twice. The motivation for splitting the proposals is to narrow the scope of each proposal which should speedup agreement. Actual work doesn't need to begin until both proposals are approved.
I'd expect the API to largely be independent of the implementation. So if we move implementation into its own proposal, that will focus discussion on the API and help us converge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Hi, i don't think it a good idea to split one design doc into two pieces.
Keep this in one doc is better for reader to find the details in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
// "PS": TFReplicaSpec, | ||
// "Worker": TFReplicaSpec, | ||
// } | ||
TFReplicaSpecs map[TFReplicaType]*TFReplicaSpec `json:"tfReplicaSpecs"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not get rid of TFReplicaType and use arbitrary strings for the replica name?
The suggestion in various discussions was to replace TFReplicaType with various properties on the Replica as opposed to inferring them based on the replica type.
As an example, restart behavior is currently inferred based on TFReplicaType. But that could instead be based on specific properties like RestartPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i'm not very clear about your comment: get rid of TFReplicaType
. IMO, TFReplicaType
is a key value for distributed TensorFlow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make it
TFReplicaSpecs map[string]*TFReplicaSpec
And let the user pick what keys to use and by extension how many different replicas to have. This would allow the user to easily add a new set of replicas to do something like evaluation.
Or in the case of reinforcement learning, they might use a set of replicas to run simulations to generate training data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if let user define this filed, how does tf-operator know how to hand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does TFOperator need to do anything differently based on the name? The TFOperator should be able to handle all replicas the same way. Any differences should be controlled by properties added to the Replica.
For example, in v1alpha1 TFOperator assigns different restart policies based on TFReplicaType.
Instead of depending on the name/Type of the replica to determine the restart behavior we could add a RestartPolicy property to TFReplicaSpec.
kubeflow/training-operator#61 is an example of an issue that would be fixed by using strings instead of TFReplicaType. If we make it a string, then a user could just call the replica "chief" or "master" based on the TF version.
Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):
What about TerminationPolicy? How does a user specify the termination criterion for the job? proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):
Per comment above, should we get rid of TFReplicaType and instead add properties to TFRepliaSpec to specify relevant behavior? I think the only thing we use TFRepliaType for is restart behavior. proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):
Should we get rid of TFReplicaType? proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file):
Can a condition be triggered more than once? proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file): Previously, inc0 (Michał Jastrzębski) wrote…
Why do you need this to be env dependent in the API to do this? Right now the way you make this env dependent is in ksonnet your TFJob would have different parameters based on the environment. So you might have parameters The values of these could then be set differently for each environment. So that you could run single worker locally and distributed in the cloud. proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):
The current implementation does the following for any event run reconcile. With the current implementation though, we don't invoke different logic for different events. This is less efficient but perhaps more reliable since we have a single code path that is always invoked. Do we have any idea how expensive calling reconcile is and whether we are saving a whole lot of resources by trying to add different logic for different events? The expensive part is probably calling the APIs to list pods/services. Can we use the informer to cache the status of pods/services and reuse them across different TFJobs so that each call to Reoncile doesn't actually send a list request to the API? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Hi, can you show us some user stories or use cases what proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
We can use proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Is there any benefit to use arbitrary strings? proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yes, will update the LastUpdateTime field. proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
As described in this proposal, Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
Yes. It allows us to support things like eval workers and other use cases more easily because we don't need to introduce new replica types. Instead we can have properties that control different behavior e.g. restart behavior and user can create however many replicas they need/want. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
SGTM. As long as Reconcile provides a backup I'm satisfied. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
Some TF programs have a "chief" replica. Other TF programs use "worker 0" as the chief. See: kubeflow/training-operator#192 I think tf_cnn_benchmarks is an example of one program that doesn't use a chief. So a field like TerminationPolicy gives the user the ability to configure the operator to match their TF program. proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
We can and still need to generate the TF cluster spec. Right now we generate the TF cluster spec by converting the TFReplicaType enum to a string. If we make it a map[string]TFReplica we could just use the string names in the cluster config. This would help with issues like Because we would no longer have to change the TFReplicaType enum everytime TF changes the naming. proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
See also proposals/tf-operator-design-v1alpha2.md, line 416 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
I think this is covered by our CUJs (critical user journeys) Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):
Sorry i'm not very clear about your point, it seems like a parameter passing issue: tell user code which worker is chief. proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):
proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Hi, I not a TensorFlow expert :) and just heard the proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yes, phases are like state machines with explicit, enumerated states and they aren't extensible. proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):
Yes, you got the point. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
The TFJob operator needs to know when a job is done. There are lots of different ways a user might signal that their program is done.
These are different examples of termination policies. TFJob operator needs to know which termination policy a user wants to use so that it can determine correctly when a job is finished.. proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
Sorry I don't understand the latest reply? proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
What do others think? I'm on the fence about this. The advantage of making it a string is that the user can add replicas we didn't think about. Validation could potentially be done in other places. For example, a user could parse TF_CONFIG and make sure the host names match the expected types. Arguably this should be done by the API that imposes the convention. For example, if tf.Estimator only allows understands "worker", "eval", "master" then if the TF_CONFIG contains some other name e.g. "workr" then it should raise an error. proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
So can we get rid of Phase from our API? The linked issue indicates Brian would like to get rid of Phase and that its existence leads to users thinking incorrectly about controllers. So why not remove it from our API now when its easy to do so? proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
So what does that mean in terms of the API? Does our API need to include a way for users to specify whether container terminations are retryable or not? Comments from Reviewable |
@jlewi Reviewable is offline on my laptop. Try to summarize the comments now:
|
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
If you pass command line arguments, how do you avoid the brittleness issues mentioned above? For example, suppose my program uses TF_CONFIG and therefore doesn't take the arguments you mentioned. Furthermore, suppose I want to treat unrecognized arguments as errors. Now the extra arguments cause my program to crash More generally, I think an API is cleaner if we avoid unnecessary redundancy. I think there are good and flexible solutions for programs that want to take the cluster spec as command line arguments. I think its better if there is a launcher script that turns TF_CONFIG into whatever environment variables people want. This way people can use whatever conventions they want for command line arguments. Using an init container it would be very easy to inject one or more automatic launcher scripts. So I think using a launcher script with/without an init container is a better pattern then us messing with command line arguments. Comments from Reviewable |
@ScorpioCPH Agreed there are lots of models out there that don't use TF_CONFIG. Our TF_CNN example in Kubeflow. Those models can easily be made to run in Kubeflow just by running a launcher script like we do for TF_CNN |
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Can you update per our discussion. To summarize what we agreed in slack. There are two use cases we want to support:
We can infer which one is the "master worker" by the following logic: So we don't need any fields in the spec. Can we update the proposal to document the fact that we will infer whether to use worker:0 or chief automatically. proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Per our discussion offline. We agreed to keep this as enum with allowed values "chief", "worker", "ps", "eval" proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
We agreed to keep it as an enum with types Does that match your understanding? proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
As noted above we will keep this an enum with types "chief", "worker", "eval", "ps" agreed? proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
I think the agreement in slack was to get rid of phase. proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
I think we agreed to only use TF_CONFIG; is that right? proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file): Previously, gaocegege (Ce Gao) wrote…
@inc0 can we resolve this thread? proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Please update the proposal with your latest thoughts after our discussion. Comments from Reviewable |
7607549
to
5d17e00
Compare
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):
Update this const (
// TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
TFReplicaTypePS TFReplicaType = "PS"
// TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
TFReplicaTypeWorker TFReplicaType = "Worker"
// TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
// If there is "chief" replica type, it's the "chief worker". Else, worker:0 is the chief worker.
TFReplicaTypeChief TFReplicaType = "Chief"
// TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
TFReplicaTypeEval TFReplicaType = "Eval"
) proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file):
Agreed. proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yes. proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yes. proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Sure. proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yes, will update the proposal doc. proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Sure. Comments from Reviewable |
5d17e00
to
57fe667
Compare
@jlewi @lluunn @DjangoPeng @gaocegege @ddysher |
/lgtm |
per this design proposal kubeflow/community#30. Update API to v1alpha2
ReplicaType part looks good to me. |
Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
Looks good. I see it in the doc below. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions. proposals/tf-operator-design-v1alpha2.md, line 93 at r2 (raw file):
Why isn't RestartPolicy a property of TFReplicaSpecs? I don't think we will want all replica specs to have the same restart policy. For example, for PS it makes sense to have a restart policy of always because these are just TF server and don't run any user code. Whereas a restart policy of OnError might only apply to the workers. proposals/tf-operator-design-v1alpha2.md, line 153 at r2 (raw file):
What does "it is not guaranteed to be set in happens-before order" mean? proposals/tf-operator-design-v1alpha2.md, line 333 at r2 (raw file):
The UID is per job not per resource? So all the items listed here would use the same value for UID. Comments from Reviewable |
57fe667
to
5082d44
Compare
Thanks for making RestartPolicy a property of the ReplicaSpec. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
per this design proposal kubeflow/community#30. Update API to v1alpha2
Hi folks,
This is a proposal for tf-operator API
v1alpha2
, PTAL, thanks!List of issues related to the API
kubeflow/training-operator#64
kubeflow/training-operator#223
/cc @jlewi @gaocegege @DjangoPeng @ddysher
This change is