-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor TFJobStatus in CRD API #333
Comments
|
@jlewi |
Yes. I generally agree we might want to let user decide what the restart behavior should be. |
@jlewi @gaocegege How about this structure for TFJobStatus: type TFJobStatus struct {
// Status for chief worker
// The number of actively running chief worker.
ActiveChief int32 `json:"activeChief"`
// The number of chief worker which was completed.
CompletedChief int32 `json:"completedChief"`
//The number of chief worker which was failed.
FailedChief int32 `json:"failedChief"`
// Status for workers (includes chief worker)
// The number of actively running workers.
ActiveWorkers int32 `json:"activeWorkers"`
// The number of workers which were completed.
CompletedWorkers int32 `json:"completedWorkers"`
//The number of workers which were failed.
FailedWorkers int32 `json:"failedWorkers"`
// Status for PSs
// The number of actively running PS.
ActivePSs int32 `json:"activePSs"`
//The number of PSs which were failed.
FailedPSs int32 `json:"failedPSs"`
} |
@ScorpioCPH How about using |
LGTM |
Yes. And the same as // Status for local job
// The number of actively running local job.
ActiveLocalJob int32 `json:"activeLocalJob"`
// The number of local job which was completed.
CompletedLocalJob int32 `json:"completedLocalJob"`
//The number of local job which was failed.
FailedLocalJob int32 `json:"failedLocalJob"` |
What is a LocalJob? Why not
This seems more flexible in terms of supporting changes to the Replica type (see #64). |
I'm not sure it is a good pattern to keep information in deeper layers. I would think it might be a little bit tough to get the information directly and explicitly. And about |
The proposal in #64 is to get rid of ReplicaType and introduce different properties to control different behaviors such as restart behavior rather than the inferring this based on ReplicaType. One motivation for this is to add replicas to do evaluation which I belief is part of the Estimator API (#61). If we define an Enum of replica types and explicitly have fields for each replica type, then any time TF introduces a new type of process (e.g. eval worker) than our API has to change.
Can you explain? If we have an array or map of Replica status then its much easier to programmatically check all replicas because you can just iterate over the map and list. I think using a container like a map or list makes it clear that all items are the same and should be treated identically. I think if you have distinct fields for each ReplicaType that makes it harder to process programmatically and doesn't make it clear that they are identical. Also we should be consistent with the TFSpec. In the Spec we treat Replicas as a list of TFReplicaSpec. So I'd expect status has a ReplicaStatus which is a list of TFReplicaStatus. In the same way that pod is an arbitrary list of containers, I think we should make TFJob an arbitrary list of replicated pods. What's the advantage of limiting TFJob to workers, ps, chief? |
Sure, each time I want to get how many worker Pods are running, I must iterate over the whole array, filter which type == workers, return the value i want. This is
A map is better than array/list, we can have many map operations (e.g. set/get).
But the containers have the same template (no different type). How abut use map: type TFJobStatus struct {
// type is string
ReplicaStatus map[string]ReplicaStatus
}
type ReplicaStatus struct{
Active int32
Completed int32
Failed int32
} |
I think the spec and status should be consistent; either both lists or both maps So option 1
Option 2
I think I prefer lists for the following reasons
Its true that if you are dealing with a list you have to do more work to find the Spec/Status for a particular Replica; but the same is true with getting container spec/status in a pod with multiple containers.. |
TFJobs are not the same as Pods, we have
Why should we store the name, i think it is read-only. Is there any chance we will modify it? Another case, we create |
I prefer map, for the following reasons:
ps:
#spec
worker:
#spec And pod has lists for containers and volumes because each container in one pod is equivalent, IMO. In our case, PS and workers are different roles so I think the map is more appropriate. |
@jlewi WDYT? Maybe we can start this change in |
@ScorpioCPH yes lets start api/v1alpha2.go and use that to itterate on an API.
I'm not sure I understand your point about ReplicaType. My suggestion is to get rid of ReplicaType and just have At this point a TFJob consists of many Replicas each identified by a Name. The pattern in K8s e.g. with volumes and containers, is that when you have a container containing many items a list of named entities is used rather than a map from name to entity. So if we use a map and not a list we aren't following K8s conventions.
If you don't store the name then I think it makes code more complicated because ReplicaSpec is
we'd have to do
If you put
I agree this might be a little more convenient but I think the API issues mentioned above are more important because they aren't hidden. In this case we can define a simple func
I don't understand this. Each container can run a different binary and have different properties (e.g. resource requirements). That seems analogous to the situation we have with replica. |
Sorry for the vague comment My opinion is that the containers could be handled by the same logic, but in our case, PS and worker have different execution path. Because PS and worker are different roles, then we have to deal with them in different ways. The logic to handle container creation may be: func HandleContainerCreation(pod *api.Pod) {
for _, container := range pod.Containers {
# Same Logic here
}
} But we can not use the same code for PS and worker, then we have to: func HandleTFJobCreation(tfJob *api.TFJob) {
for _, spec := range tfJob.Specs {
if isPS(spec) {
# PS logic
}
else {
# Worker logic
}
}
} For example, we will check if all workers are finished, because if they do the TFJob is considered to be finished. But we do not check the PS, since PS are never finished if we do not kill them. When I am trying to implement an event-driven operator for TFJob, I find that the function I think we are similar to ResourceList, which is a map in Kubernetes core v1. But it also works for me if we use list, I am not strongly against it. |
You make a good point with ResourceList. The fact that we need different logic for different types of containers seems unrelated to whether its a map or list. We will still need to have different logic. However, with the proposed changes to get rid of ReplicaType, the logic will depend instead on the TerminationPolicy of the replica. @gaocegege @ScorpioCPH you convinced me that a map is better. Lets go with a map from replica name to spec and a map for replica name to status. |
@ScorpioCPH @gaocegege Any update on api/v1alpha2? I checked the refactor branch. I didn't see a [v1alpha2] (https://github.com/kubeflow/tf-operator/tree/refactor/pkg/apis/tensorflow) directory. |
@jlewi I think we are going to implement v1alpha2 these days. |
I'm working on a design doc for v1alpha2. Let's design first then implement it :)
|
Closed by #492 |
Hi, this is a separate tracker/discussion issue from #283.
Motivation
As we discussed,
TFJobStatus
need to be more specified to tracking individual TF job status, here is our considerations:What we have now
There is a little confusing about
TFReplicaStatus
what we have now:Proposed Design
Add
TFClusterStatus
inTFJob.Status
:This topic is open to discuss, so please show your ideas and let's make it clearer together :)
@jlewi @jimexist @DjangoPeng @gaocegege @ddysher @mqliang @mitake WDYT?
The text was updated successfully, but these errors were encountered: