Add tf-operator design doc for API v1alpha2 #30

ScorpioCPH · 2018-03-05T11:18:17Z

Hi folks,

This is a proposal for tf-operator API v1alpha2, PTAL, thanks!

List of issues related to the API
kubeflow/training-operator#64
kubeflow/training-operator#223

/cc @jlewi @gaocegege @DjangoPeng @ddysher

This change is

gaocegege · 2018-03-05T13:24:43Z

/lgtm

inc0

Few comments, mostly request for further discussion

inc0 · 2018-03-05T16:07:09Z

proposals/tf-operator-design-v1alpha2.md

+
+**E2E Test**
+
+We can use this model from TensorFlow [repo](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dist_test) for e2e test.


Can you describe example e2e workflow? From beginning (empty jupyter) to the end (multinode tensorflow serving)?

I think it would be useful to have discussion (separate, dedicated doc?) how do we want to connect jupyter to tf-operator with user-experience in mind.

Another thing to consider which might affect this design is how do we transfer artifacts (code and trained models) between steps in workflow - I (data scientist) write code in jupyter -> somehow export it to place where tf-job (learning) can access it -> tf-job runs same code in distributed and scaled fashion -> tf-job saves model somewhere -> tf-serving runs in scaled fashion and consumes trained model.

This proposal is for tf-operator, and tf-operator does not know the jupyter since it is not in tf-operator's management. The operator only knows the YAML config file and I think the e2e test from jupyter to serving should be done in kubeflow/kubeflow.

@inc0 Thanks for your high-level comments, it makes totally sense for me.

For e2e workflow we can re-use kubernetes/test-infra:

Deploy a local/remote kubernetes cluster.

Deploy TFJob controller and create CRD.

Run e2e test case:

Load e2e test model from scratch.

Create TFJob yaml and deploy it.

Wait and check the result.

IMHO, put e2e design into a dedicated doc maybe better.
And i think TensorFlow serving is a very important feature of tf-operator, but i don't take something to think about it now :)

inc0 · 2018-03-05T16:17:04Z

proposals/tf-operator-design-v1alpha2.md

+  + For example, `worker0.example.com:2222,worker1.example.com:2222,worker2.example.com:2222`
+- `job_name`: job name: worker or ps.
+- `task_index`: task index, should be >= 0.
+  + For example, worker task_index = 0 is the `chief` worker task the performs the variable initialization.


I think it would be useful for these configurations to be env-dependant. For example we could run same job in local (notebook) single node for smoke testing and first metrics and then, ideally without code change, run it distributed to do actual learning.

Thanks for your suggestion, and I think we already implement it since the Kubernetes could schedule all tasks of the job in one node if we only have one node.

rongou · 2018-03-05T17:35:05Z

proposals/tf-operator-design-v1alpha2.md

+
+**Auto-Generated Arguments:**
+
+To make distributed TensorFlow work, user **MUST** implement a parser in their code like `ArgumentParser` to get the distributed TensorFlow configurations which generated by `tf-operator`, the common built-in arguments are:


I wonder if it's better to pass these in as environment variables instead of command line arguments. Seems it would be more flexible, as the user code can ignore these if they don't care about them, rather than being forced to parse them.

I'd say we could do both, or even add config file into the mix - CLI overriding ENV overriding config file

+1 to Rong's suggestion of using environment variables and more specifically the environment variable TF_CONFIG.

TensorFlow has already adopted the environment variabble TF_CONFIG
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L84

as a convention for passing runtime information (e.g. # of workers) to libraries like tf.Estimator that can process that information.

Why would we try to define a new convention in the form of a set of specific command line arguments?

Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"? What if code is using tf.Estimator and therefore can automatically use TF_CONFIG and strict argument parsing. Then the argument parser would raise an exception because of unrecognized command line arguments. So users would have to define arguments specific to TfJob operator even though the user's code was never using these arguments.
new convention in the form of a set of specific command line arguments?

Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"?

We have some discussions in kubeflow/training-operator#369:

TF_CONFIG is a TensorFlow convention that TF APIs like the EstimatorAPI use to get information about the runtime environment and configure the job appropriately.

If users want to use command line arguments, they can write a launcher script that parses TF_CONFIG and sets the command line arguments as needed. This is what we do in the TFCNN example; see https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py

I think the problem with command line arguments is that everyone will use slightly different conventions.

@jlewi Hi, any tensorflow official docs about TF_CONFIG, I check the example from tensorflow website: https://www.tensorflow.org/deploy/distributed.

And i notice that Run Config (deprecated, use tf.estimator.RunConfig instead). in here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L15

And what if user don not use estimator library?

https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig

The RunConfig docs say that it will automatically be populated based on the environment variable TF_CONFIG if it is set.

My hope is that most high level libraries will follow the convention of using the environment variable TF_CONFIG to automatically configure themselves.

As @gaocegege points out above. If a user is using a library that doesn't use TF_CONFIG then they have two options

Modify the code to read and parse TF_CONFIG

Create a launcher script (possibly using an init container) to parse TF_CONFIG and set command line arguments appropriately.

I really think we should try to avoid introducing new conventions (e.g. standard command line flags) when the TF community already has an existing convention.

Furthermore, when I look at TF code I don't see broad agreement about what command line flags should be used or how they should be structured.

@jlewi Thanks, will take a deeper look at this later.

@jlewi After investigating, TF_CONFIG is used for tf.estimators, which is a high level encapsulation. How about pass the both TF_CONFIG and ClusterSpec together?

Generate TF_CONFIG JSON string and set it in Env field.

Generate ClusterSpec args and set it in Args field.

So both estimators model and other tensorflow model can work together.

@jlewi For TF_CONFIG i mean Env filed, sorry for the typo.

jlewi · 2018-03-05T19:14:43Z

proposals/tf-operator-design-v1alpha2.md

+
+### Error Handling
+
+To make the system robust, the tf-operator should be able to locally and automatically recover from errors. 


What about retryable vs. permanent errors? The current implementation uses exit codes to distinguish between retryable and permanent errors? Should we continue to use that?

Hi, I think this is open to discuss. IMHO, separate retryable and permanent errors maybe not easy, so let user to decide this maybe better.

So I agree user should decide whether an error is retryable or not, but that needs to be communicated to the TFJob operator so that it can determine what to do.

For example, here are two different cases where TF could be terminated

The VM running the pod is terminated or becomes unhealthy

TF exits with an exception because user tries to read a file that doesn't exist.

My claim is that in the first case TF operator should restart TF since its a retryable error but in second case the error is most likely permanent (file is unlikely to exist if we keep retrying) so the job should fail.

Exit codes provide a mechanism for distinguishing different types of errors. User has some control over the exit code. For example they can wrap their code in a try/catch block and turn different exceptions into exit codes corresponding to retryable or permanent errors as needed.

This isn't great but its fairly straightforward.

Hi,

In case 1, AFAIK this case will create a node NotReady condition, kubernetes will re-schedule all pods which running on this node to another node after a while. It is handled automatically by kubernetes system, not controlled by any user defined config (e.g. RestartPolicy or exit code).

In case 2, this is an error in container, so let use handle this error maybe better, by setting RestartPolicy.

In case 1, I don't think K8s will reschedule the pods. Pods aren't durable
https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

In case 2. I don't think we can do this with the built in RestartPolicy because the only options are Never or Always and I don't think that's flexible enough.

jlewi · 2018-03-05T19:16:03Z

proposals/tf-operator-design-v1alpha2.md

+// TFJobStatus represents the current observed state of the TFJob.
+type TFJobStatus struct {
+    // Phase is the recently observed lifecycle phase of the TFJob.
+    Phase TFJobPhase `json:"phase"`


Do we need Phase if we have conditions? I thought @gaocegege or someone else suggested that K8s was moving away from using Phase and using conditions instead?

Phase is required but we should use condition to determine the the state of the TFJob. These two fields are all needed, IMO

@gaocegege Required by K8s?

@jlewi By convention, I think

For reference: kubeflow/training-operator#223 as quoted in their from the docs

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated.

My interpretation of this is that not all V1 resources contain phase. So I don't think we should include phase.

I think the information in phase overlaps with conditions. So by providing phase and conditions are API isn't as clean as it could be because users will need to figure out which one to look at.

jlewi · 2018-03-05T19:17:35Z

proposals/tf-operator-design-v1alpha2.md

+- Define the structure of API `v1alpha2`.
+    + Cover most of the refactoring requests we have discussed.
+    + Simplify the API definition.
+- Define an `event-driven` mechanism for TFJob life-cycle management.


Would it make sense to split the implementation into its own proposal as opposed to using 1 proposal for the implementation and API?

I think they are bundled. If we do not migrate the implementation to the new design we have to implement twice, once for etcd-operator style and once for event-driven style. I think it is unnecessary. 🤔

Hi, this proposal follows the proposal-template, it makes sense for me to put API and design together in one proposal :)

@gaocegege @ScorpioCPH Splitting the proposals doesn't mean implementing them twice. The motivation for splitting the proposals is to narrow the scope of each proposal which should speedup agreement. Actual work doesn't need to begin until both proposals are approved.

I'd expect the API to largely be independent of the implementation. So if we move implementation into its own proposal, that will focus discussion on the API and help us converge.

@jlewi Hi, i don't think it a good idea to split one design doc into two pieces.
Keep this in one doc is better for reader to find the details in one place.

jlewi · 2018-03-05T19:20:01Z

proposals/tf-operator-design-v1alpha2.md

+    //     "PS": TFReplicaSpec,
+    //     "Worker": TFReplicaSpec,
+    //   }
+    TFReplicaSpecs map[TFReplicaType]*TFReplicaSpec `json:"tfReplicaSpecs"`


Why not get rid of TFReplicaType and use arbitrary strings for the replica name?
The suggestion in various discussions was to replace TFReplicaType with various properties on the Replica as opposed to inferring them based on the replica type.

As an example, restart behavior is currently inferred based on TFReplicaType. But that could instead be based on specific properties like RestartPolicy.

Sorry, i'm not very clear about your comment: get rid of TFReplicaType. IMO, TFReplicaType is a key value for distributed TensorFlow.

Why not just make it

TFReplicaSpecs map[string]*TFReplicaSpec

And let the user pick what keys to use and by extension how many different replicas to have. This would allow the user to easily add a new set of replicas to do something like evaluation.

Or in the case of reinforcement learning, they might use a set of replicas to run simulations to generate training data.

But if let user define this filed, how does tf-operator know how to hand it?

Why does TFOperator need to do anything differently based on the name? The TFOperator should be able to handle all replicas the same way. Any differences should be controlled by properties added to the Replica.

For example, in v1alpha1 TFOperator assigns different restart policies based on TFReplicaType.

Instead of depending on the name/Type of the replica to determine the restart behavior we could add a RestartPolicy property to TFReplicaSpec.

kubeflow/training-operator#61 is an example of an issue that would be fixed by using strings instead of TFReplicaType. If we make it a string, then a user could just call the replica "chief" or "master" based on the TF version.

jlewi · 2018-03-05T19:41:23Z

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

// TFJobSpec is a desired state description of the TFJob.
type TFJobSpec struct {
    // TFReplicaSpecs is map of TFReplicaType and TFReplicaSpec

What about TerminationPolicy?
https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L58

How does a user specify the termination criterion for the job?

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

    // to describe how the containers within the pod should be restarted.
    // Please set this restart policy carefully according to your code.
    Template *v1.PodTemplateSpec `json:"template,omitempty"`

Per comment above, should we get rid of TFReplicaType and instead add properties to TFRepliaSpec to specify relevant behavior?

I think the only thing we use TFRepliaType for is restart behavior.
Right now we always restart parameters servers; we don't look at exit code to determine if error is permanent or retryable.
Should we make this configurable so that for each Replica user can specify whether to always retry? Or only retry certain exit codes?

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

```go
// TFReplicaType is the type for TFReplica.
type TFReplicaType string

Should we get rid of TFReplicaType?

proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file):

    // which specified by user in v1.PodTemplateSpec.
    // The training is freezing/pending.
    TFJobRestarting TFJobConditionType = "Restarting"

Can a condition be triggered more than once?

proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I think it would be useful for these configurations to be env-dependant. For example we could run same job in local (notebook) single node for smoke testing and first metrics and then, ideally without code change, run it distributed to do actual learning.

Why do you need this to be env dependent in the API to do this? Right now the way you make this env dependent is in ksonnet your TFJob would have different parameters based on the environment. So you might have parameters
num_workers
num_ps
num_master

The values of these could then be set differently for each environment. So that you could run single worker locally and distributed in the cloud.

proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

### Reconciler

More than that, we should provide a `Reconciler` mechanism to reconcile observed and desired states and repair discrepancies as a double check for `Event-Driven` mechanism.

The current implementation does the following for any event run reconcile.
An event could be a specific event (e.g. TFJob created/deleted/updated). We also use the informer to generate a periodic update event to ensure reconcile is called regularly.

With the current implementation though, we don't invoke different logic for different events. This is less efficient but perhaps more reliable since we have a single code path that is always invoked.

Do we have any idea how expensive calling reconcile is and whether we are saving a whole lot of resources by trying to add different logic for different events?

The expensive part is probably calling the APIs to list pods/services. Can we use the informer to cache the status of pods/services and reuse them across different TFJobs so that each call to Reoncile doesn't actually send a list request to the API?

Comments from Reviewable

Misoperation

ScorpioCPH · 2018-03-06T03:06:29Z

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What about TerminationPolicy?
https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L58

How does a user specify the termination criterion for the job?

Hi, can you show us some user stories or use cases what TerminationPolicy is used for?

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Per comment above, should we get rid of TFReplicaType and instead add properties to TFRepliaSpec to specify relevant behavior?

I think the only thing we use TFRepliaType for is restart behavior.
Right now we always restart parameters servers; we don't look at exit code to determine if error is permanent or retryable.
Should we make this configurable so that for each Replica user can specify whether to always retry? Or only retry certain exit codes?

We can use TFRepliaType for different handling logic, for example, generate TF cluster spec configuration :)

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Should we get rid of TFReplicaType?

Is there any benefit to use arbitrary strings?

proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can a condition be triggered more than once?

Yes, will update the LastUpdateTime field.

proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

The current implementation does the following for any event run reconcile.
An event could be a specific event (e.g. TFJob created/deleted/updated). We also use the informer to generate a periodic update event to ensure reconcile is called regularly.

With the current implementation though, we don't invoke different logic for different events. This is less efficient but perhaps more reliable since we have a single code path that is always invoked.

Do we have any idea how expensive calling reconcile is and whether we are saving a whole lot of resources by trying to add different logic for different events?

The expensive part is probably calling the APIs to list pods/services. Can we use the informer to cache the status of pods/services and reuse them across different TFJobs so that each call to Reoncile doesn't actually send a list request to the API?

As described in this proposal, Event-Driven is our first choice, we should trust the reported observed state from informer and use Reconciler as a double-check (maybe deprecated in the future)

Comments from Reviewable

jlewi · 2018-03-06T18:36:37Z

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Is there any benefit to use arbitrary strings?

Yes. It allows us to support things like eval workers and other use cases more easily because we don't need to introduce new replica types. Instead we can have properties that control different behavior e.g. restart behavior and user can create however many replicas they need/want.

Comments from Reviewable

jlewi · 2018-03-06T18:37:33Z

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

As described in this proposal, Event-Driven is our first choice, we should trust the reported observed state from informer and use Reconciler as a double-check (maybe deprecated in the future)

SGTM. As long as Reconcile provides a backup I'm satisfied.

Comments from Reviewable

jlewi · 2018-03-07T05:45:35Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Hi, can you show us some user stories or use cases what TerminationPolicy is used for?

Some TF programs have a "chief" replica. Other TF programs use "worker 0" as the chief.

See: kubeflow/training-operator#192

I think tf_cnn_benchmarks is an example of one program that doesn't use a chief.

So a field like TerminationPolicy gives the user the ability to configure the operator to match their TF program.

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

We can use TFRepliaType for different handling logic, for example, generate TF cluster spec configuration :)

We can and still need to generate the TF cluster spec. Right now we generate the TF cluster spec by converting the TFReplicaType enum to a string. If we make it a map[string]TFReplica we could just use the string names in the cluster config.

This would help with issues like
kubeflow/training-operator#61

Because we would no longer have to change the TFReplicaType enum everytime TF changes the naming.

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Yes. It allows us to support things like eval workers and other use cases more easily because we don't need to introduce new replica types. Instead we can have properties that control different behavior e.g. restart behavior and user can create however many replicas they need/want.

See also
kubeflow/training-operator#61

proposals/tf-operator-design-v1alpha2.md, line 416 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@inc0 Thanks for your high-level comments, it makes totally sense for me.

For e2e workflow we can re-use kubernetes/test-infra:

Deploy a local/remote kubernetes cluster.

Deploy TFJob controller and create CRD.

Run e2e test case:

Load e2e test model from scratch.

Create TFJob yaml and deploy it.

Wait and check the result.

IMHO, put e2e design into a dedicated doc maybe better.
And i think TensorFlow serving is a very important feature of tf-operator, but i don't take something to think about it now :)

I think this is covered by our CUJs (critical user journeys)
kubeflow/kubeflow#87

Comments from Reviewable

ScorpioCPH · 2018-03-07T07:38:17Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

configure the operator to match their TF program.

Sorry i'm not very clear about your point, it seems like a parameter passing issue: tell user code which worker is chief.
Is this any relationship with Termination?

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

enumerated

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

See also
kubeflow/training-operator#61

Hi, I not a TensorFlow expert :) and just heard the eval workers first time.
How about enumerate the all type we support now, and add new types in the next version.
Yes, arbitrary strings are free and flexible, but more flexible means increasing mistakes/errors:
For example, user put a typo workr in this filed, it hard to debug this typo.
We can report a type error immediately by using enumeration.

proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

For reference: kubeflow/training-operator#223 as quoted in their from the docs

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated.

My interpretation of this is that not all V1 resources contain phase. So I don't think we should include phase.

I think the information in phase overlaps with conditions. So by providing phase and conditions are API isn't as clean as it could be because users will need to figure out which one to look at.

Yes, phases are like state machines with explicit, enumerated states and they aren't extensible.
There are Phases and Conditions in Pod/Node spec now.
And FYI, kubernetes/kubernetes#7856, there is a discussion about Phases and Conditions.

proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

Yes, you got the point.

Comments from Reviewable

jlewi · 2018-03-07T16:34:19Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

configure the operator to match their TF program.

Sorry i'm not very clear about your point, it seems like a parameter passing issue: tell user code which worker is chief.
Is this any relationship with Termination?

The TFJob operator needs to know when a job is done. There are lots of different ways a user might signal that their program is done.

A user program might have a chief replica and when that chief exits with exit 0 that should be interpreted as the job completing successfully
The user might use worker 0 as the chief in which case TFJob should be marked completed when the worker 0 exits with exit code zero.
The user might want TFJob to run untill all workers have completed with exit code 0.

These are different examples of termination policies. TFJob operator needs to know which termination policy a user wants to use so that it can determine correctly when a job is finished..

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

enumerated

Sorry I don't understand the latest reply?

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Hi, I not a TensorFlow expert :) and just heard the eval workers first time.
How about enumerate the all type we support now, and add new types in the next version.
Yes, arbitrary strings are free and flexible, but more flexible means increasing mistakes/errors:
For example, user put a typo workr in this filed, it hard to debug this typo.
We can report a type error immediately by using enumeration.

What do others think?

I'm on the fence about this. The advantage of making it a string is that the user can add replicas we didn't think about.

Validation could potentially be done in other places. For example, a user could parse TF_CONFIG and make sure the host names match the expected types.

Arguably this should be done by the API that imposes the convention. For example, if tf.Estimator only allows understands "worker", "eval", "master" then if the TF_CONFIG contains some other name e.g. "workr" then it should raise an error.

proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Yes, phases are like state machines with explicit, enumerated states and they aren't extensible.
There are Phases and Conditions in Pod/Node spec now.
And FYI, kubernetes/kubernetes#7856, there is a discussion about Phases and Conditions.

So can we get rid of Phase from our API? The linked issue indicates Brian would like to get rid of Phase and that its existence leads to users thinking incorrectly about controllers.

So why not remove it from our API now when its easy to do so?

proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

Yes, you got the point.

So what does that mean in terms of the API? Does our API need to include a way for users to specify whether container terminations are retryable or not?

Comments from Reviewable

ScorpioCPH · 2018-03-08T14:13:13Z

@jlewi Reviewable is offline on my laptop. Try to summarize the comments now:

TF_CONFIG & ClusterSpec
- we can generate them both for different use case.
- pass them both into container.
- use can pass user-defined arguments by themselves.
TerminationPolicy & exit code
- we should get rid of TerminationPolicy as it is used for identify which worker is chief in API v1alph1.
  - use worker 0 as chief worker by default and keep it as a convention to make API simple.
  - users who are not a TensorFlow expert should not change the default chief worker (worker 0).
- I strongly recommend do not use any exit codes to interrupt user's code.
- restarting will take no effect unless user add specific code (reload checkpoint file by themselves).
  - so let user to make this decision makes more sense.
- how about add WorkerRestartPolicy in TFJob.
  - RestartAll: user know that they have wrote the reloading code so restarting will take effect.
    - we will set all Pods' RestartPolicy to OnFailure.
  - RestartChief: only set Pod worker-0 RestartPolicy to OnFailure.
  - Never: will do nothing.
TFReplcaType
- use enumeration for this filed make it clear and explicit.
- this proposal is focus on TensorFlow distributed training ("ps" and "worker"), so "eval" in out of the scope. I file an issue for supporting "eval" worker and we can add this feature in the next version maybe :)

jlewi · 2018-03-08T14:30:20Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@jlewi After investigating, TF_CONFIG is used for tf.estimators, which is a high level encapsulation. How about pass the both TF_CONFIG and ClusterSpec together?

Generate TF_CONFIG JSON string and set it in Command field.

Generate ClusterSpec args and set it in Args field.

So both estimators model and other tensorflow model can work together.

If you pass command line arguments, how do you avoid the brittleness issues mentioned above? For example, suppose my program uses TF_CONFIG and therefore doesn't take the arguments you mentioned. Furthermore, suppose I want to treat unrecognized arguments as errors. Now the extra arguments cause my program to crash

More generally, I think an API is cleaner if we avoid unnecessary redundancy.

I think there are good and flexible solutions for programs that want to take the cluster spec as command line arguments.

I think its better if there is a launcher script that turns TF_CONFIG into whatever environment variables people want. This way people can use whatever conventions they want for command line arguments.

Using an init container it would be very easy to inject one or more automatic launcher scripts.

So I think using a launcher script with/without an init container is a better pattern then us messing with command line arguments.

Comments from Reviewable

ScorpioCPH · 2018-03-08T14:48:13Z

@jlewi Please take a look at this model, which don't use tf.estimators, they don't know what TF_CONFIG is.

jlewi · 2018-03-09T21:47:55Z

@ScorpioCPH Agreed there are lots of models out there that don't use TF_CONFIG. Our TF_CNN example in Kubeflow. Those models can easily be made to run in Kubeflow just by running a launcher script like we do for TF_CNN
https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py

jlewi · 2018-03-09T21:55:04Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

The TFJob operator needs to know when a job is done. There are lots of different ways a user might signal that their program is done.

A user program might have a chief replica and when that chief exits with exit 0 that should be interpreted as the job completing successfully

The user might use worker 0 as the chief in which case TFJob should be marked completed when the worker 0 exits with exit code zero.

The user might want TFJob to run untill all workers have completed with exit code 0.

These are different examples of termination policies. TFJob operator needs to know which termination policy a user wants to use so that it can determine correctly when a job is finished..

Can you update per our discussion.

To summarize what we agreed in slack.

There are two use cases we want to support:

some users use worker:0 as the "master worker".
Others use use "chief" replica type as the "master worker"
(Although Caicloud folks only use 1, there are many users use 2, so we want to support both)

We can infer which one is the "master worker" by the following logic:
if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

So we don't need any fields in the spec.

Can we update the proposal to document the fact that we will infer whether to use worker:0 or chief automatically.

proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why does TFOperator need to do anything differently based on the name? The TFOperator should be able to handle all replicas the same way. Any differences should be controlled by properties added to the Replica.

For example, in v1alpha1 TFOperator assigns different restart policies based on TFReplicaType.

Instead of depending on the name/Type of the replica to determine the restart behavior we could add a RestartPolicy property to TFReplicaSpec.

kubeflow/training-operator#61 is an example of an issue that would be fixed by using strings instead of TFReplicaType. If we make it a string, then a user could just call the replica "chief" or "master" based on the TF version.

Per our discussion offline.

We agreed to keep this as enum with allowed values

"chief", "worker", "ps", "eval"

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Sorry I don't understand the latest reply?

We agreed to keep it as an enum with types
chief, worker, ps eval

Does that match your understanding?

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What do others think?

I'm on the fence about this. The advantage of making it a string is that the user can add replicas we didn't think about.

Validation could potentially be done in other places. For example, a user could parse TF_CONFIG and make sure the host names match the expected types.

Arguably this should be done by the API that imposes the convention. For example, if tf.Estimator only allows understands "worker", "eval", "master" then if the TF_CONFIG contains some other name e.g. "workr" then it should raise an error.

As noted above we will keep this an enum with types

"chief", "worker", "eval", "ps"

agreed?

proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

So can we get rid of Phase from our API? The linked issue indicates Brian would like to get rid of Phase and that its existence leads to users thinking incorrectly about controllers.

So why not remove it from our API now when its easy to do so?

I think the agreement in slack was to get rid of phase.

proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@jlewi For TF_CONFIG i mean Env filed, sorry for the typo.

I think we agreed to only use TF_CONFIG; is that right?

proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file):

Previously, gaocegege (Ce Gao) wrote…

Thanks for your suggestion, and I think we already implement it since the Kubernetes could schedule all tasks of the job in one node if we only have one node.

@inc0 can we resolve this thread?

proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

So what does that mean in terms of the API? Does our API need to include a way for users to specify whether container terminations are retryable or not?

Please update the proposal with your latest thoughts after our discussion.

Comments from Reviewable

ScorpioCPH · 2018-03-12T06:39:23Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

Update this TFReplicaType to:

const (
    // TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
    TFReplicaTypePS TFReplicaType = "PS"

    // TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
    TFReplicaTypeWorker TFReplicaType = "Worker"

    // TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
    // If there is "chief" replica type, it's the "chief worker". Else, worker:0 is the chief worker.
    TFReplicaTypeChief TFReplicaType = "Chief"

    // TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
    TFReplicaTypeEval TFReplicaType = "Eval"
)

proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file):

We agreed to keep this as enum with allowed values
"chief", "worker", "ps", "eval"

Agreed.

proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

We agreed to keep it as an enum with types
chief, worker, ps eval

Does that match your understanding?

Yes.

proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

As noted above we will keep this an enum with types

"chief", "worker", "eval", "ps"

agreed?

Yes.

proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think the agreement in slack was to get rid of phase.

Sure.

proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think we agreed to only use TF_CONFIG; is that right?

Yes, will update the proposal doc.

proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Please update the proposal with your latest thoughts after our discussion.

Sure.

Comments from Reviewable

ScorpioCPH · 2018-03-12T06:44:11Z

@jlewi @lluunn @DjangoPeng @gaocegege @ddysher
Update this doc after the off-line discussion, PTAL, thanks!

gaocegege · 2018-03-12T13:00:15Z

/lgtm

per this design proposal kubeflow/community#30. Update API to v1alpha2

lluunn · 2018-03-12T19:19:58Z

ReplicaType part looks good to me.

jlewi · 2018-03-17T00:19:54Z

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

Update this TFReplicaType to:

const (
    // TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
    TFReplicaTypePS TFReplicaType = "PS"

    // TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
    TFReplicaTypeWorker TFReplicaType = "Worker"

    // TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
    // If there is "chief" replica type, it's the "chief worker". Else, worker:0 is the chief worker.
    TFReplicaTypeChief TFReplicaType = "Chief"

    // TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
    TFReplicaTypeEval TFReplicaType = "Eval"
)

Looks good. I see it in the doc below.

Comments from Reviewable

jlewi · 2018-03-17T00:29:23Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions.

proposals/tf-operator-design-v1alpha2.md, line 93 at r2 (raw file):

// RestartPolicy describes how the TFReplicas should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one

Why isn't RestartPolicy a property of TFReplicaSpecs? I don't think we will want all replica specs to have the same restart policy.

For example, for PS it makes sense to have a restart policy of always because these are just TF server and don't run any user code.

Whereas a restart policy of OnError might only apply to the workers.

proposals/tf-operator-design-v1alpha2.md, line 153 at r2 (raw file):

    // Represents time when the TFJob was acknowledged by the TFJob controller.
    // It is not guaranteed to be set in happens-before order across separate operations.

What does "it is not guaranteed to be set in happens-before order" mean?

proposals/tf-operator-design-v1alpha2.md, line 333 at r2 (raw file):

      We will create:
      - `two` pair pods/services for PSs:
        - tf-job-name-ps-1-uid

The UID is per job not per resource? So all the items listed here would use the same value for UID.

Comments from Reviewable

jlewi · 2018-03-17T20:31:18Z

Thanks for making RestartPolicy a property of the ReplicaSpec.
/lgtm

k8s-ci-robot · 2018-03-17T20:31:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

per this design proposal kubeflow/community#30. Update API to v1alpha2

k8s-ci-robot requested review from ddysher, DjangoPeng, gaocegege and jlewi March 5, 2018 11:18

k8s-ci-robot added the size/L label Mar 5, 2018

k8s-ci-robot assigned gaocegege Mar 5, 2018

k8s-ci-robot added the lgtm label Mar 5, 2018

inc0 suggested changes Mar 5, 2018

View reviewed changes

rongou reviewed Mar 5, 2018

View reviewed changes

jlewi reviewed Mar 5, 2018

View reviewed changes

gaocegege previously approved these changes Mar 6, 2018

View reviewed changes

jlewi mentioned this pull request Mar 7, 2018

tensorflow 1.4 and estimator support kubeflow/training-operator#61

Closed

This was referenced Mar 7, 2018

Use OpenAPI validation for CRDs in k8s 1.9 kubeflow/training-operator#437

Closed

Deprecate TfPort and set default port for users kubeflow/training-operator#327

Closed

API: some comments about API changes from PR #215 review kubeflow/training-operator#249

Closed

ScorpioCPH mentioned this pull request Mar 8, 2018

[Feature] Support "eval" worker in tf-operator kubeflow/training-operator#444

Closed

ScorpioCPH force-pushed the tf-operator-design-v1alpha2 branch from 7607549 to 5d17e00 Compare March 12, 2018 06:23

k8s-ci-robot removed the lgtm label Mar 12, 2018

ScorpioCPH force-pushed the tf-operator-design-v1alpha2 branch from 5d17e00 to 57fe667 Compare March 12, 2018 06:41

ScorpioCPH mentioned this pull request Mar 12, 2018

Update API to v1alpha2 kubeflow/training-operator#457

Merged

k8s-ci-robot added the lgtm label Mar 12, 2018

gaocegege pushed a commit to kubeflow/training-operator that referenced this pull request Mar 12, 2018

Update API to v1alpha2 (#457)

8249121

per this design proposal kubeflow/community#30. Update API to v1alpha2

ScorpioCPH mentioned this pull request Mar 15, 2018

Import v1alpha2 logic code kubeflow/training-operator#463

Merged

Add tf-operator-design-v1alpha2

5082d44

ScorpioCPH force-pushed the tf-operator-design-v1alpha2 branch from 57fe667 to 5082d44 Compare March 17, 2018 12:27

k8s-ci-robot removed the lgtm label Mar 17, 2018

k8s-ci-robot assigned jlewi Mar 17, 2018

k8s-ci-robot added the lgtm label Mar 17, 2018

k8s-ci-robot added the approved label Mar 17, 2018

k8s-ci-robot merged commit 099c277 into kubeflow:master Mar 17, 2018

ScorpioCPH deleted the tf-operator-design-v1alpha2 branch March 19, 2018 06:27

ScorpioCPH mentioned this pull request Mar 19, 2018

Add tf-operator v1alpha2 design proposal into accepted-proposals #42

Merged

gaocegege pushed a commit to kubeflow/training-operator that referenced this pull request Mar 19, 2018

Update API to v1alpha2 (#457)

cc30e25

per this design proposal kubeflow/community#30. Update API to v1alpha2

This was referenced Jun 11, 2018

TFJob not marked as success when master exits but not workers kubeflow/training-operator#634

Closed

[proposal]TFJob condition for v1alpha2 kubeflow/training-operator#562

Closed

woop pushed a commit to woop/community that referenced this pull request Nov 16, 2020

Test (kubeflow#30)

834c099


		E2E Test

		We can use this model from TensorFlow [repo](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dist_test) for e2e test.


		Auto-Generated Arguments:

		To make distributed TensorFlow work, user MUST implement a parser in their code like `ArgumentParser` to get the distributed TensorFlow configurations which generated by `tf-operator`, the common built-in arguments are:


		### Error Handling

		To make the system robust, the tf-operator should be able to locally and automatically recover from errors.

Add tf-operator design doc for API v1alpha2 #30

Add tf-operator design doc for API v1alpha2 #30

Conversation

ScorpioCPH commented Mar 5, 2018 • edited by jlewi Loading

gaocegege commented Mar 5, 2018

inc0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Mar 5, 2018

ScorpioCPH commented Mar 6, 2018

jlewi commented Mar 6, 2018

jlewi commented Mar 6, 2018

jlewi commented Mar 7, 2018

ScorpioCPH commented Mar 7, 2018

jlewi commented Mar 7, 2018

ScorpioCPH commented Mar 8, 2018

jlewi commented Mar 8, 2018

ScorpioCPH commented Mar 8, 2018

jlewi commented Mar 9, 2018

jlewi commented Mar 9, 2018

ScorpioCPH commented Mar 12, 2018

ScorpioCPH commented Mar 12, 2018

gaocegege commented Mar 12, 2018

lluunn commented Mar 12, 2018

jlewi commented Mar 17, 2018

jlewi commented Mar 17, 2018

jlewi commented Mar 17, 2018

k8s-ci-robot commented Mar 17, 2018

ScorpioCPH commented Mar 5, 2018 •

edited by jlewi

Loading

gaocegege Mar 6, 2018 •

edited

Loading

ScorpioCPH Mar 6, 2018 •

edited

Loading

ScorpioCPH Mar 8, 2018 •

edited

Loading

ScorpioCPH Mar 7, 2018 •

edited

Loading

gaocegege Mar 6, 2018 •

edited

Loading

ScorpioCPH Mar 7, 2018 •

edited

Loading