Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tf-operator design doc for API v1alpha2 #30

Merged

Conversation

ScorpioCPH
Copy link
Member

@ScorpioCPH ScorpioCPH commented Mar 5, 2018

Hi folks,

This is a proposal for tf-operator API v1alpha2, PTAL, thanks!

List of issues related to the API
kubeflow/training-operator#64
kubeflow/training-operator#223

/cc @jlewi @gaocegege @DjangoPeng @ddysher


This change is Reviewable

@gaocegege
Copy link
Member

/lgtm

Copy link

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, mostly request for further discussion


**E2E Test**

We can use this model from TensorFlow [repo](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dist_test) for e2e test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe example e2e workflow? From beginning (empty jupyter) to the end (multinode tensorflow serving)?

I think it would be useful to have discussion (separate, dedicated doc?) how do we want to connect jupyter to tf-operator with user-experience in mind.

Another thing to consider which might affect this design is how do we transfer artifacts (code and trained models) between steps in workflow - I (data scientist) write code in jupyter -> somehow export it to place where tf-job (learning) can access it -> tf-job runs same code in distributed and scaled fashion -> tf-job saves model somewhere -> tf-serving runs in scaled fashion and consumes trained model.

Copy link
Member

@gaocegege gaocegege Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is for tf-operator, and tf-operator does not know the jupyter since it is not in tf-operator's management. The operator only knows the YAML config file and I think the e2e test from jupyter to serving should be done in kubeflow/kubeflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 Thanks for your high-level comments, it makes totally sense for me.

For e2e workflow we can re-use kubernetes/test-infra:

  • Deploy a local/remote kubernetes cluster.
  • Deploy TFJob controller and create CRD.
  • Run e2e test case:
    • Load e2e test model from scratch.
    • Create TFJob yaml and deploy it.
    • Wait and check the result.

IMHO, put e2e design into a dedicated doc maybe better.
And i think TensorFlow serving is a very important feature of tf-operator, but i don't take something to think about it now :)

+ For example, `worker0.example.com:2222,worker1.example.com:2222,worker2.example.com:2222`
- `job_name`: job name: worker or ps.
- `task_index`: task index, should be >= 0.
+ For example, worker task_index = 0 is the `chief` worker task the performs the variable initialization.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful for these configurations to be env-dependant. For example we could run same job in local (notebook) single node for smoke testing and first metrics and then, ideally without code change, run it distributed to do actual learning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, and I think we already implement it since the Kubernetes could schedule all tasks of the job in one node if we only have one node.


**Auto-Generated Arguments:**

To make distributed TensorFlow work, user **MUST** implement a parser in their code like `ArgumentParser` to get the distributed TensorFlow configurations which generated by `tf-operator`, the common built-in arguments are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to pass these in as environment variables instead of command line arguments. Seems it would be more flexible, as the user code can ignore these if they don't care about them, rather than being forced to parse them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we could do both, or even add config file into the mix - CLI overriding ENV overriding config file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Rong's suggestion of using environment variables and more specifically the environment variable TF_CONFIG.

TensorFlow has already adopted the environment variabble TF_CONFIG
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L84

as a convention for passing runtime information (e.g. # of workers) to libraries like tf.Estimator that can process that information.

Why would we try to define a new convention in the form of a set of specific command line arguments?

Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"? What if code is using tf.Estimator and therefore can automatically use TF_CONFIG and strict argument parsing. Then the argument parser would raise an exception because of unrecognized command line arguments. So users would have to define arguments specific to TfJob operator even though the user's code was never using these arguments.
new convention in the form of a set of specific command line arguments?

Passing command line arguments also seems brittle. What if existing code uses different flag parsing conventions e.g. "-" as opposed to "--"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some discussions in kubeflow/training-operator#369:

TF_CONFIG is a TensorFlow convention that TF APIs like the EstimatorAPI use to get information about the runtime environment and configure the job appropriately.

If users want to use command line arguments, they can write a launcher script that parses TF_CONFIG and sets the command line arguments as needed. This is what we do in the TFCNN example; see https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py

I think the problem with command line arguments is that everyone will use slightly different conventions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Hi, any tensorflow official docs about TF_CONFIG, I check the example from tensorflow website: https://www.tensorflow.org/deploy/distributed.

And i notice that Run Config (deprecated, use tf.estimator.RunConfig instead). in here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L15

Copy link
Member Author

@ScorpioCPH ScorpioCPH Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if user don not use estimator library?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig

The RunConfig docs say that it will automatically be populated based on the environment variable TF_CONFIG if it is set.

My hope is that most high level libraries will follow the convention of using the environment variable TF_CONFIG to automatically configure themselves.

As @gaocegege points out above. If a user is using a library that doesn't use TF_CONFIG then they have two options

  1. Modify the code to read and parse TF_CONFIG
  2. Create a launcher script (possibly using an init container) to parse TF_CONFIG and set command line arguments appropriately.

I really think we should try to avoid introducing new conventions (e.g. standard command line flags) when the TF community already has an existing convention.

Furthermore, when I look at TF code I don't see broad agreement about what command line flags should be used or how they should be structured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Thanks, will take a deeper look at this later.

Copy link
Member Author

@ScorpioCPH ScorpioCPH Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi After investigating, TF_CONFIG is used for tf.estimators, which is a high level encapsulation. How about pass the both TF_CONFIG and ClusterSpec together?

  • Generate TF_CONFIG JSON string and set it in Env field.
  • Generate ClusterSpec args and set it in Args field.

So both estimators model and other tensorflow model can work together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi For TF_CONFIG i mean Env filed, sorry for the typo.


### Error Handling

To make the system robust, the tf-operator should be able to locally and automatically recover from errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about retryable vs. permanent errors? The current implementation uses exit codes to distinguish between retryable and permanent errors? Should we continue to use that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think this is open to discuss. IMHO, separate retryable and permanent errors maybe not easy, so let user to decide this maybe better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree user should decide whether an error is retryable or not, but that needs to be communicated to the TFJob operator so that it can determine what to do.

For example, here are two different cases where TF could be terminated

  1. The VM running the pod is terminated or becomes unhealthy
  2. TF exits with an exception because user tries to read a file that doesn't exist.

My claim is that in the first case TF operator should restart TF since its a retryable error but in second case the error is most likely permanent (file is unlikely to exist if we keep retrying) so the job should fail.

Exit codes provide a mechanism for distinguishing different types of errors. User has some control over the exit code. For example they can wrap their code in a try/catch block and turn different exceptions into exit codes corresponding to retryable or permanent errors as needed.

This isn't great but its fairly straightforward.

Copy link
Member Author

@ScorpioCPH ScorpioCPH Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

  • In case 1, AFAIK this case will create a node NotReady condition, kubernetes will re-schedule all pods which running on this node to another node after a while. It is handled automatically by kubernetes system, not controlled by any user defined config (e.g. RestartPolicy or exit code).

  • In case 2, this is an error in container, so let use handle this error maybe better, by setting RestartPolicy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case 1, I don't think K8s will reschedule the pods. Pods aren't durable
https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

In case 2. I don't think we can do this with the built in RestartPolicy because the only options are Never or Always and I don't think that's flexible enough.

// TFJobStatus represents the current observed state of the TFJob.
type TFJobStatus struct {
// Phase is the recently observed lifecycle phase of the TFJob.
Phase TFJobPhase `json:"phase"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Phase if we have conditions? I thought @gaocegege or someone else suggested that K8s was moving away from using Phase and using conditions instead?

Copy link
Member

@gaocegege gaocegege Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase is required but we should use condition to determine the the state of the TFJob. These two fields are all needed, IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege Required by K8s?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi By convention, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: kubeflow/training-operator#223 as quoted in their from the docs

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated.

My interpretation of this is that not all V1 resources contain phase. So I don't think we should include phase.

I think the information in phase overlaps with conditions. So by providing phase and conditions are API isn't as clean as it could be because users will need to figure out which one to look at.

- Define the structure of API `v1alpha2`.
+ Cover most of the refactoring requests we have discussed.
+ Simplify the API definition.
- Define an `event-driven` mechanism for TFJob life-cycle management.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to split the implementation into its own proposal as opposed to using 1 proposal for the implementation and API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are bundled. If we do not migrate the implementation to the new design we have to implement twice, once for etcd-operator style and once for event-driven style. I think it is unnecessary. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this proposal follows the proposal-template, it makes sense for me to put API and design together in one proposal :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege @ScorpioCPH Splitting the proposals doesn't mean implementing them twice. The motivation for splitting the proposals is to narrow the scope of each proposal which should speedup agreement. Actual work doesn't need to begin until both proposals are approved.

I'd expect the API to largely be independent of the implementation. So if we move implementation into its own proposal, that will focus discussion on the API and help us converge.

Copy link
Member Author

@ScorpioCPH ScorpioCPH Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Hi, i don't think it a good idea to split one design doc into two pieces.
Keep this in one doc is better for reader to find the details in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

// "PS": TFReplicaSpec,
// "Worker": TFReplicaSpec,
// }
TFReplicaSpecs map[TFReplicaType]*TFReplicaSpec `json:"tfReplicaSpecs"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get rid of TFReplicaType and use arbitrary strings for the replica name?
The suggestion in various discussions was to replace TFReplicaType with various properties on the Replica as opposed to inferring them based on the replica type.

As an example, restart behavior is currently inferred based on TFReplicaType. But that could instead be based on specific properties like RestartPolicy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i'm not very clear about your comment: get rid of TFReplicaType. IMO, TFReplicaType is a key value for distributed TensorFlow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make it

TFReplicaSpecs map[string]*TFReplicaSpec 

And let the user pick what keys to use and by extension how many different replicas to have. This would allow the user to easily add a new set of replicas to do something like evaluation.

Or in the case of reinforcement learning, they might use a set of replicas to run simulations to generate training data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if let user define this filed, how does tf-operator know how to hand it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does TFOperator need to do anything differently based on the name? The TFOperator should be able to handle all replicas the same way. Any differences should be controlled by properties added to the Replica.

For example, in v1alpha1 TFOperator assigns different restart policies based on TFReplicaType.

Instead of depending on the name/Type of the replica to determine the restart behavior we could add a RestartPolicy property to TFReplicaSpec.

kubeflow/training-operator#61 is an example of an issue that would be fixed by using strings instead of TFReplicaType. If we make it a string, then a user could just call the replica "chief" or "master" based on the TF version.

@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

// TFJobSpec is a desired state description of the TFJob.
type TFJobSpec struct {
    // TFReplicaSpecs is map of TFReplicaType and TFReplicaSpec

What about TerminationPolicy?
https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L58

How does a user specify the termination criterion for the job?


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

    // to describe how the containers within the pod should be restarted.
    // Please set this restart policy carefully according to your code.
    Template *v1.PodTemplateSpec `json:"template,omitempty"`

Per comment above, should we get rid of TFReplicaType and instead add properties to TFRepliaSpec to specify relevant behavior?

I think the only thing we use TFRepliaType for is restart behavior.
Right now we always restart parameters servers; we don't look at exit code to determine if error is permanent or retryable.
Should we make this configurable so that for each Replica user can specify whether to always retry? Or only retry certain exit codes?


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

```go
// TFReplicaType is the type for TFReplica.
type TFReplicaType string

Should we get rid of TFReplicaType?


proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file):

    // which specified by user in v1.PodTemplateSpec.
    // The training is freezing/pending.
    TFJobRestarting TFJobConditionType = "Restarting"

Can a condition be triggered more than once?


proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I think it would be useful for these configurations to be env-dependant. For example we could run same job in local (notebook) single node for smoke testing and first metrics and then, ideally without code change, run it distributed to do actual learning.

Why do you need this to be env dependent in the API to do this? Right now the way you make this env dependent is in ksonnet your TFJob would have different parameters based on the environment. So you might have parameters
num_workers
num_ps
num_master

The values of these could then be set differently for each environment. So that you could run single worker locally and distributed in the cloud.


proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

### Reconciler

More than that, we should provide a `Reconciler` mechanism to reconcile observed and desired states and repair discrepancies as a double check for `Event-Driven` mechanism.

The current implementation does the following for any event run reconcile.
An event could be a specific event (e.g. TFJob created/deleted/updated). We also use the informer to generate a periodic update event to ensure reconcile is called regularly.

With the current implementation though, we don't invoke different logic for different events. This is less efficient but perhaps more reliable since we have a single code path that is always invoked.

Do we have any idea how expensive calling reconcile is and whether we are saving a whole lot of resources by trying to add different logic for different events?

The expensive part is probably calling the APIs to list pods/services. Can we use the informer to cache the status of pods/services and reuse them across different TFJobs so that each call to Reoncile doesn't actually send a list request to the API?


Comments from Reviewable

gaocegege
gaocegege previously approved these changes Mar 6, 2018
@gaocegege gaocegege dismissed their stale review March 6, 2018 02:31

Misoperation

@ScorpioCPH
Copy link
Member Author

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What about TerminationPolicy?
https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1alpha1/types.go#L58

How does a user specify the termination criterion for the job?

Hi, can you show us some user stories or use cases what TerminationPolicy is used for?


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Per comment above, should we get rid of TFReplicaType and instead add properties to TFRepliaSpec to specify relevant behavior?

I think the only thing we use TFRepliaType for is restart behavior.
Right now we always restart parameters servers; we don't look at exit code to determine if error is permanent or retryable.
Should we make this configurable so that for each Replica user can specify whether to always retry? Or only retry certain exit codes?

We can use TFRepliaType for different handling logic, for example, generate TF cluster spec configuration :)


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Should we get rid of TFReplicaType?

Is there any benefit to use arbitrary strings?


proposals/tf-operator-design-v1alpha2.md, line 211 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can a condition be triggered more than once?

Yes, will update the LastUpdateTime field.


proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

The current implementation does the following for any event run reconcile.
An event could be a specific event (e.g. TFJob created/deleted/updated). We also use the informer to generate a periodic update event to ensure reconcile is called regularly.

With the current implementation though, we don't invoke different logic for different events. This is less efficient but perhaps more reliable since we have a single code path that is always invoked.

Do we have any idea how expensive calling reconcile is and whether we are saving a whole lot of resources by trying to add different logic for different events?

The expensive part is probably calling the APIs to list pods/services. Can we use the informer to cache the status of pods/services and reuse them across different TFJobs so that each call to Reoncile doesn't actually send a list request to the API?

As described in this proposal, Event-Driven is our first choice, we should trust the reported observed state from informer and use Reconciler as a double-check (maybe deprecated in the future)


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 6, 2018

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Is there any benefit to use arbitrary strings?

Yes. It allows us to support things like eval workers and other use cases more easily because we don't need to introduce new replica types. Instead we can have properties that control different behavior e.g. restart behavior and user can create however many replicas they need/want.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 6, 2018

Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 347 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

As described in this proposal, Event-Driven is our first choice, we should trust the reported observed state from informer and use Reconciler as a double-check (maybe deprecated in the future)

SGTM. As long as Reconcile provides a backup I'm satisfied.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 7, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Hi, can you show us some user stories or use cases what TerminationPolicy is used for?

Some TF programs have a "chief" replica. Other TF programs use "worker 0" as the chief.

See: kubeflow/training-operator#192

I think tf_cnn_benchmarks is an example of one program that doesn't use a chief.

So a field like TerminationPolicy gives the user the ability to configure the operator to match their TF program.


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

We can use TFRepliaType for different handling logic, for example, generate TF cluster spec configuration :)

We can and still need to generate the TF cluster spec. Right now we generate the TF cluster spec by converting the TFReplicaType enum to a string. If we make it a map[string]TFReplica we could just use the string names in the cluster config.

This would help with issues like
kubeflow/training-operator#61

Because we would no longer have to change the TFReplicaType enum everytime TF changes the naming.


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Yes. It allows us to support things like eval workers and other use cases more easily because we don't need to introduce new replica types. Instead we can have properties that control different behavior e.g. restart behavior and user can create however many replicas they need/want.

See also
kubeflow/training-operator#61


proposals/tf-operator-design-v1alpha2.md, line 416 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@inc0 Thanks for your high-level comments, it makes totally sense for me.

For e2e workflow we can re-use kubernetes/test-infra:

  • Deploy a local/remote kubernetes cluster.
  • Deploy TFJob controller and create CRD.
  • Run e2e test case:
    • Load e2e test model from scratch.
    • Create TFJob yaml and deploy it.
    • Wait and check the result.

IMHO, put e2e design into a dedicated doc maybe better.
And i think TensorFlow serving is a very important feature of tf-operator, but i don't take something to think about it now :)

I think this is covered by our CUJs (critical user journeys)
kubeflow/kubeflow#87


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

configure the operator to match their TF program.

Sorry i'm not very clear about your point, it seems like a parameter passing issue: tell user code which worker is chief.
Is this any relationship with Termination?


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

enumerated


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

See also
kubeflow/training-operator#61

Hi, I not a TensorFlow expert :) and just heard the eval workers first time.
How about enumerate the all type we support now, and add new types in the next version.
Yes, arbitrary strings are free and flexible, but more flexible means increasing mistakes/errors:
For example, user put a typo workr in this filed, it hard to debug this typo.
We can report a type error immediately by using enumeration.


proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

For reference: kubeflow/training-operator#223 as quoted in their from the docs

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated.

My interpretation of this is that not all V1 resources contain phase. So I don't think we should include phase.

I think the information in phase overlaps with conditions. So by providing phase and conditions are API isn't as clean as it could be because users will need to figure out which one to look at.

Yes, phases are like state machines with explicit, enumerated states and they aren't extensible.
There are Phases and Conditions in Pod/Node spec now.
And FYI, kubernetes/kubernetes#7856, there is a discussion about Phases and Conditions.


proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

Yes, you got the point.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 7, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

configure the operator to match their TF program.

Sorry i'm not very clear about your point, it seems like a parameter passing issue: tell user code which worker is chief.
Is this any relationship with Termination?

The TFJob operator needs to know when a job is done. There are lots of different ways a user might signal that their program is done.

  1. A user program might have a chief replica and when that chief exits with exit 0 that should be interpreted as the job completing successfully
  2. The user might use worker 0 as the chief in which case TFJob should be marked completed when the worker 0 exits with exit code zero.
  3. The user might want TFJob to run untill all workers have completed with exit code 0.

These are different examples of termination policies. TFJob operator needs to know which termination policy a user wants to use so that it can determine correctly when a job is finished..


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

enumerated

Sorry I don't understand the latest reply?


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Hi, I not a TensorFlow expert :) and just heard the eval workers first time.
How about enumerate the all type we support now, and add new types in the next version.
Yes, arbitrary strings are free and flexible, but more flexible means increasing mistakes/errors:
For example, user put a typo workr in this filed, it hard to debug this typo.
We can report a type error immediately by using enumeration.

What do others think?

I'm on the fence about this. The advantage of making it a string is that the user can add replicas we didn't think about.

Validation could potentially be done in other places. For example, a user could parse TF_CONFIG and make sure the host names match the expected types.

Arguably this should be done by the API that imposes the convention. For example, if tf.Estimator only allows understands "worker", "eval", "master" then if the TF_CONFIG contains some other name e.g. "workr" then it should raise an error.


proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Yes, phases are like state machines with explicit, enumerated states and they aren't extensible.
There are Phases and Conditions in Pod/Node spec now.
And FYI, kubernetes/kubernetes#7856, there is a discussion about Phases and Conditions.

So can we get rid of Phase from our API? The linked issue indicates Brian would like to get rid of Phase and that its existence leads to users thinking incorrectly about controllers.

So why not remove it from our API now when its easy to do so?


proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

I think its the controller that creates a new pod but in our case we are creating/managing the pods ourselves.

Yes, you got the point.

So what does that mean in terms of the API? Does our API need to include a way for users to specify whether container terminations are retryable or not?


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

@jlewi Reviewable is offline on my laptop. Try to summarize the comments now:

  • TF_CONFIG & ClusterSpec

    • we can generate them both for different use case.
    • pass them both into container.
    • use can pass user-defined arguments by themselves.
  • TerminationPolicy & exit code

    • we should get rid of TerminationPolicy as it is used for identify which worker is chief in API v1alph1.
      • use worker 0 as chief worker by default and keep it as a convention to make API simple.
      • users who are not a TensorFlow expert should not change the default chief worker (worker 0).
    • I strongly recommend do not use any exit codes to interrupt user's code.
    • restarting will take no effect unless user add specific code (reload checkpoint file by themselves).
      • so let user to make this decision makes more sense.
    • how about add WorkerRestartPolicy in TFJob.
      • RestartAll: user know that they have wrote the reloading code so restarting will take effect.
        • we will set all Pods' RestartPolicy to OnFailure.
      • RestartChief: only set Pod worker-0 RestartPolicy to OnFailure.
      • Never: will do nothing.
  • TFReplcaType

    • use enumeration for this filed make it clear and explicit.
    • this proposal is focus on TensorFlow distributed training ("ps" and "worker"), so "eval" in out of the scope. I file an issue for supporting "eval" worker and we can add this feature in the next version maybe :)

@jlewi
Copy link
Contributor

jlewi commented Mar 8, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@jlewi After investigating, TF_CONFIG is used for tf.estimators, which is a high level encapsulation. How about pass the both TF_CONFIG and ClusterSpec together?

  • Generate TF_CONFIG JSON string and set it in Command field.
  • Generate ClusterSpec args and set it in Args field.

So both estimators model and other tensorflow model can work together.

If you pass command line arguments, how do you avoid the brittleness issues mentioned above? For example, suppose my program uses TF_CONFIG and therefore doesn't take the arguments you mentioned. Furthermore, suppose I want to treat unrecognized arguments as errors. Now the extra arguments cause my program to crash

More generally, I think an API is cleaner if we avoid unnecessary redundancy.

I think there are good and flexible solutions for programs that want to take the cluster spec as command line arguments.

I think its better if there is a launcher script that turns TF_CONFIG into whatever environment variables people want. This way people can use whatever conventions they want for command line arguments.

Using an init container it would be very easy to inject one or more automatic launcher scripts.

So I think using a launcher script with/without an init container is a better pattern then us messing with command line arguments.


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

@jlewi Please take a look at this model, which don't use tf.estimators, they don't know what TF_CONFIG is.

@jlewi
Copy link
Contributor

jlewi commented Mar 9, 2018

@ScorpioCPH Agreed there are lots of models out there that don't use TF_CONFIG. Our TF_CNN example in Kubeflow. Those models can easily be made to run in Kubeflow just by running a launcher script like we do for TF_CNN
https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py

@jlewi
Copy link
Contributor

jlewi commented Mar 9, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

The TFJob operator needs to know when a job is done. There are lots of different ways a user might signal that their program is done.

  1. A user program might have a chief replica and when that chief exits with exit 0 that should be interpreted as the job completing successfully
  2. The user might use worker 0 as the chief in which case TFJob should be marked completed when the worker 0 exits with exit code zero.
  3. The user might want TFJob to run untill all workers have completed with exit code 0.

These are different examples of termination policies. TFJob operator needs to know which termination policy a user wants to use so that it can determine correctly when a job is finished..

Can you update per our discussion.

To summarize what we agreed in slack.

There are two use cases we want to support:

  1. some users use worker:0 as the "master worker".
  2. Others use use "chief" replica type as the "master worker"
    (Although Caicloud folks only use 1, there are many users use 2, so we want to support both)

We can infer which one is the "master worker" by the following logic:
if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

So we don't need any fields in the spec.

Can we update the proposal to document the fact that we will infer whether to use worker:0 or chief automatically.


proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why does TFOperator need to do anything differently based on the name? The TFOperator should be able to handle all replicas the same way. Any differences should be controlled by properties added to the Replica.

For example, in v1alpha1 TFOperator assigns different restart policies based on TFReplicaType.

Instead of depending on the name/Type of the replica to determine the restart behavior we could add a RestartPolicy property to TFReplicaSpec.

kubeflow/training-operator#61 is an example of an issue that would be fixed by using strings instead of TFReplicaType. If we make it a string, then a user could just call the replica "chief" or "master" based on the TF version.

Per our discussion offline.

We agreed to keep this as enum with allowed values

"chief", "worker", "ps", "eval"


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Sorry I don't understand the latest reply?

We agreed to keep it as an enum with types
chief, worker, ps eval

Does that match your understanding?


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What do others think?

I'm on the fence about this. The advantage of making it a string is that the user can add replicas we didn't think about.

Validation could potentially be done in other places. For example, a user could parse TF_CONFIG and make sure the host names match the expected types.

Arguably this should be done by the API that imposes the convention. For example, if tf.Estimator only allows understands "worker", "eval", "master" then if the TF_CONFIG contains some other name e.g. "workr" then it should raise an error.

As noted above we will keep this an enum with types

"chief", "worker", "eval", "ps"

agreed?


proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

So can we get rid of Phase from our API? The linked issue indicates Brian would like to get rid of Phase and that its existence leads to users thinking incorrectly about controllers.

So why not remove it from our API now when its easy to do so?

I think the agreement in slack was to get rid of phase.


proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@jlewi For TF_CONFIG i mean Env filed, sorry for the typo.

I think we agreed to only use TF_CONFIG; is that right?


proposals/tf-operator-design-v1alpha2.md, line 303 at r1 (raw file):

Previously, gaocegege (Ce Gao) wrote…

Thanks for your suggestion, and I think we already implement it since the Kubernetes could schedule all tasks of the job in one node if we only have one node.

@inc0 can we resolve this thread?


proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

So what does that mean in terms of the API? Does our API need to include a way for users to specify whether container terminations are retryable or not?

Please update the proposal with your latest thoughts after our discussion.


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

Update this TFReplicaType to:

const (
    // TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
    TFReplicaTypePS TFReplicaType = "PS"

    // TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
    TFReplicaTypeWorker TFReplicaType = "Worker"

    // TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
    // If there is "chief" replica type, it's the "chief worker". Else, worker:0 is the chief worker.
    TFReplicaTypeChief TFReplicaType = "Chief"

    // TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
    TFReplicaTypeEval TFReplicaType = "Eval"
)

proposals/tf-operator-design-v1alpha2.md, line 83 at r1 (raw file):

We agreed to keep this as enum with allowed values
"chief", "worker", "ps", "eval"

Agreed.


proposals/tf-operator-design-v1alpha2.md, line 100 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

We agreed to keep it as an enum with types
chief, worker, ps eval

Does that match your understanding?

Yes.


proposals/tf-operator-design-v1alpha2.md, line 107 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

As noted above we will keep this an enum with types

"chief", "worker", "eval", "ps"

agreed?

Yes.


proposals/tf-operator-design-v1alpha2.md, line 126 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think the agreement in slack was to get rid of phase.

Sure.


proposals/tf-operator-design-v1alpha2.md, line 295 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think we agreed to only use TF_CONFIG; is that right?

Yes, will update the proposal doc.


proposals/tf-operator-design-v1alpha2.md, line 392 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Please update the proposal with your latest thoughts after our discussion.

Sure.


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

@jlewi @lluunn @DjangoPeng @gaocegege @ddysher
Update this doc after the off-line discussion, PTAL, thanks!

@gaocegege
Copy link
Member

/lgtm

gaocegege pushed a commit to kubeflow/training-operator that referenced this pull request Mar 12, 2018
per this design proposal kubeflow/community#30.
Update API to v1alpha2
@lluunn
Copy link
Contributor

lluunn commented Mar 12, 2018

ReplicaType part looks good to me.

@jlewi
Copy link
Contributor

jlewi commented Mar 17, 2018

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 76 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

if there is "chief" replica type, it's the "master worker". Else, worker:0 is the master worker.

Update this TFReplicaType to:

const (
    // TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
    TFReplicaTypePS TFReplicaType = "PS"

    // TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
    TFReplicaTypeWorker TFReplicaType = "Worker"

    // TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
    // If there is "chief" replica type, it's the "chief worker". Else, worker:0 is the chief worker.
    TFReplicaTypeChief TFReplicaType = "Chief"

    // TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
    TFReplicaTypeEval TFReplicaType = "Eval"
)

Looks good. I see it in the doc below.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 17, 2018

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions.


proposals/tf-operator-design-v1alpha2.md, line 93 at r2 (raw file):

// RestartPolicy describes how the TFReplicas should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one

Why isn't RestartPolicy a property of TFReplicaSpecs? I don't think we will want all replica specs to have the same restart policy.

For example, for PS it makes sense to have a restart policy of always because these are just TF server and don't run any user code.

Whereas a restart policy of OnError might only apply to the workers.


proposals/tf-operator-design-v1alpha2.md, line 153 at r2 (raw file):

    // Represents time when the TFJob was acknowledged by the TFJob controller.
    // It is not guaranteed to be set in happens-before order across separate operations.

What does "it is not guaranteed to be set in happens-before order" mean?


proposals/tf-operator-design-v1alpha2.md, line 333 at r2 (raw file):

      We will create:
      - `two` pair pods/services for PSs:
        - tf-job-name-ps-1-uid

The UID is per job not per resource? So all the items listed here would use the same value for UID.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 17, 2018

Thanks for making RestartPolicy a property of the ReplicaSpec.
/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 099c277 into kubeflow:master Mar 17, 2018
@ScorpioCPH ScorpioCPH deleted the tf-operator-design-v1alpha2 branch March 19, 2018 06:27
gaocegege pushed a commit to kubeflow/training-operator that referenced this pull request Mar 19, 2018
per this design proposal kubeflow/community#30.
Update API to v1alpha2
woop pushed a commit to woop/community that referenced this pull request Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants