-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1alpha2: Add implementation #526
Conversation
The test coverage of v1alpha2 should be 70%+, PTAL |
) | ||
|
||
// ServerOption is the main context object for the controller manager. | ||
type ServerOption struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding the flag enableGangScheduling
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator/app/options/options.go#L29 ? Probably I'm the most suitable person for doing this, so please let me know if you want me to add the field for the gang scheduling after merging this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#490
We have an issueto keep track of the progress of syncing commits with v1alpha1. Really appreciate if you could help us implement in v1alpha2. And as kube-arbitrator maintainers said, maybe we do not need to import PDB now.
} | ||
go run(stopCh) | ||
|
||
var key string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully sure but this style of communication would be warned by the race detector? How about adding a mutex or pass the key via a channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so. I will open an issue to keep track. I think it is better not to import more enhancements in this PR since it is copied from upstream/v1alpha2 branch and I think we could record the potential issues and fix it after it is merged.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, opening an issue and tracking would be ok.
Would you mind submitting a separate PR for the signals package just to cut down the size of this PR a bit? |
@jlewi Sure I will do it. |
@jlewi But the test will fail because the code can not be built. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
I only briefly scanned the PR, will walk though the code again later :)
) | ||
|
||
var ( | ||
Version = "0.3.0+git" | ||
Version = "v0.1.0-alpha" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still the right version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We released a version v0.1.0 and I am not sure if we should set the version to it. Since it is behind the master.
pkg/controller.v2/controller.go
Outdated
// To allow injection of syncTFJob for testing. | ||
syncHandler func(tfJobKey string) (bool, error) | ||
|
||
updateStatusHandler func(tfjob *tfv1alpha2.TFJob) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this saved as a handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to mock it for test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆗
|
||
// AddFlags adds flags for a specific CMServer to the specified FlagSet. | ||
func (s *ServerOption) AddFlags(fs *flag.FlagSet) { | ||
fs.StringVar(&s.Kubeconfig, "kubeconfig", "~/.kube/config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion in #522, should this default flag be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will update it.
// getPodsForTFJob returns the set of pods that this tfjob should manage. | ||
// It also reconciles ControllerRef by adopting/orphaning. | ||
// Note that the returned Pods are pointers into the cache. | ||
func (tc *TFJobController) getPodsForTFJob(tfjob *tfv1alpha2.TFJob) ([]*v1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if i'm wrong, but getPodsForTFJob
doesn't seem to capture the intent of this method - claiming (i.e. patch ownership) pods are already done in the method, returning a list of pods is just a side effect of calling this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we should change the name to getAndClaimPodsForTFJob
. And I think it is better to remove the logic of claiming, since there is no need to claim the pods every time. I can open an issue for it.
WDYT @ScorpioCPH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function name follows the other controller style, like this https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L401.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Ports: []v1.ServicePort{ | ||
{ | ||
Name: genGeneralName(tfjobKey, rt, index), | ||
Port: 2222, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the port hard coded here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed the field TFPort
, thus we should reuse the ports in podtemplatespec. And I will open an issue to keep track. Currently we use a hard coded port and do not support customization.
I do not intent to implement in this PR since it is copied from upstream/v1alpha2. I will file a new PR for this feature.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limitations under the License. | ||
*/ | ||
|
||
// Note(CPH): this file is copied form k8s.io/kubernetes/pkg/controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how hard would it be to vendor this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no vendor k8s.io/kubernetes/
, and CPH thinks that there is no need to import such a vendor for this struct.
WDYT @ddysher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, k8s.io/kubernetes
is too large for us.
The test will fail since there is no signals package. |
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
cmd/tf-operator.v2/app/server.go
Outdated
const RecommendedKubeConfigPathEnv = "KUBECONFIG" | ||
|
||
func Run(opt *options.ServerOption) error { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Rebased upstream/master and addressed some comments. PTAL |
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Review status: 0 of 15 files reviewed at latest revision, 8 unresolved discussions. cmd/tf-operator.v2/app/options/options.go, line 38 at r2 (raw file): Previously, gaocegege (Ce Gao) wrote…
Don't we get the path of the KubeCOnfig from an environment variable? So is the plan to get rid of this flag altogether? pkg/controller.v2/controller.go, line 152 at r7 (raw file):
Can you explain this to me? Why do track expectations in the form of aggregated statistics? Why wouldn't we keep track of expectations for each replica? e.g. have an expectation for If we just look at aggregated statistics won't we have problems? e.g. suppose we expect 2 workers Then we haven't actually created both services even though we observed 2 create events. pkg/controller.v2/controller_pod.go, line 54 at r7 (raw file):
Why are you only checking the number of pods? Why aren't you checking that each individual replica has a pod in the expected state? Not only will checking each replica be more robust, but it positions us to do a better job debugging problems because we will know what state each replica is in. pkg/controller.v2/controller_pod.go, line 61 at r7 (raw file):
Here and everywhere else please use a context logger like And log fields for the This will ensure these fields are stored as metadata in the json log entry which makes filtering log messages for a particular job to figure out what happened easier. pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file):
Per comments above; can we check that every replica is in the expected state? pkg/controller.v2/controller_pod.go, line 109 at r7 (raw file):
Please use logrus to log this with rt and index as metadata fields so we can easily filter log messages by replica. This will make it really easy to filter logs to find out what happened during a replicas life. pkg/controller.v2/controller_pod.go, line 148 at r7 (raw file):
nit; Define a named constant pkg/controller.v2/controller_pod.go, line 203 at r7 (raw file):
Should we log a warning if index isn't found in desiredIndexes? That would seem like it indicates a bug of some sort since that shouldn't happen. pkg/controller.v2/controller_pod.go, line 218 at r7 (raw file):
What does adopting/orphaning pods mean? I assume it means changing the owner ref on the pod to point to this TFJob controller. Do we actually expect this to happen? Why would the labels or selector get changed on the pod? pkg/controller.v2/controller_pod.go, line 295 at r7 (raw file):
Lets add appropriate metadata fields to the long entry; e.g. tfjob name and uuid and maybe replica type and index? Comments from Reviewable |
cmd/tf-operator.v2/app/options/options.go, line 38 at r2 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Make sense. I could remove the flag later. Personally, I think it is better not to import more enhancements into the PR, since I think it is a sync with upstream/v1alpha2. We are using it to implement some systems and I hope we could use this commit instead of the branch Comments from Reviewable |
pkg/controller.v2/controller_pod.go, line 54 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Please see above, could we merge it and enhance the implementation iteratively? Comments from Reviewable |
pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
No, I think it is a big change and I am not sure map is the best way. Thus we could merge it and enhance the implementation iteratively. Comments from Reviewable |
pkg/controller.v2/controller_pod.go, line 203 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Because we need to import logger and then do it. I think we could do it after the PR merged. Or I could file a new PR based on this PR to fix it. Comments from Reviewable |
Review status: 3 of 15 files reviewed at latest revision, 12 unresolved discussions. cmd/tf-operator.v2/app/options/options.go, line 38 at r2 (raw file): Previously, gaocegege (Ce Gao) wrote…
I don't understand your point about doing more changes in this PR. The point of this PR is to start checking code into master. Why would we treat the work going on in the v1alpha2 branch any differently from work going on in a local branch I think we should make all the changes in this PR such that the resulting code that's checked into master is in a good state and the source of truth. So I think we should make this and other straightforward changes in this PR. I don't think we should check in code only to remove it in a follow on PR. pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, gaocegege (Ce Gao) wrote…
If we're uncertain about how to implement it then why check it in? Why not remove the code that we are uncertain about (replace it with a TODO). Alternatively, is there a simpler, less efficient but correct implementation we could check in? I don't like the idea of checking in code that we don't think works. Could we just iterate over all the replicas and check for each one if we need to do any work? pkg/controller.v2/controller_pod.go, line 203 at r7 (raw file): Previously, gaocegege (Ce Gao) wrote…
Isn't controller_pod.go already importing log? Comments from Reviewable |
Review status: 3 of 15 files reviewed at latest revision, 13 unresolved discussions. pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Hi, getDiffPodIndexes() will check which replica should be handled. pkg/controller.v2/controller_pod.go, line 218 at r7 (raw file): Previously, gaocegege (Ce Gao) wrote…
We can't assume that only Comments from Reviewable |
Signed-off-by: Ce Gao <gaoce@caicloud.io>
cmd/tf-operator.v2/app/options/options.go, line 38 at r2 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Done Comments from Reviewable |
pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
@ScorpioCPH I think Jeremy means that we use an integer instead of a map to represent the succeeded/failed pods. Comments from Reviewable |
Review status: 3 of 15 files reviewed at latest revision, 12 unresolved discussions. pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, gaocegege (Ce Gao) wrote…
I think the problem is that we check if diff < 0 to decided whether there is any work to do. As noted in my earlier comment I think that check is insufficient under certain conditions; e.g. the case where you have multiple pods for a given. You already get a list of all pods for this job. So I think you want to group these pods based on replica. You can then correctly check
pkg/controller.v2/controller_pod.go, line 218 at r7 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
Thanks. Comments from Reviewable |
@jlewi @gaocegege v1alpha2 is not just a api change? we are doing a totally refactoring??? can we still keep compatiblty with v1alpha1? |
pkg/controller.v2/controller_pod.go, line 203 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Done. Comments from Reviewable |
pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
I cam not sure if I understand you. Do you mean the code similar to here: func (tc *TFJobController) syncPods(pods []*v1.Pod, replicas int, logger *log.Entry) {
podSlices := getPodSlices(pods, replicas, logger)
for index, podSlice := range podSlices {
if len(podSlice) > 1 {
logger.Warning("We have to many pods for the worker %d", index)
// Kill some
}
if len(podsSlice) == 0 {
// Create one
}
pod := podSlice[0]
// We already have one, and check if it is succeede or something else.
}
}
func getPodSlices(pods []*v1.Pod, replicas int, logger *log.Entry) [][]*v1.Pod {
resMap := make([][]*v1.Pod)
for _, pod := range pods {
if _, ok := pod.Labels[tfReplicaIndexLabel]; !ok {
logger.Warning("The pod do not have the index label.")
}
index, err := strconv.Atoi(pod.Labels[tfReplicaIndexLabel])
if err != nil {
logger.Warning("Error when strconv.Atoi: %v", err)
}
if index < 0 || index >= replicas {
logger.Warningf("The label index is not expected: %d", index)
}
resMap[index] = append(resMap[index], pod)
}
} Comments from Reviewable |
Review status: 3 of 17 files reviewed at latest revision, 11 unresolved discussions. cmd/tf-operator.v2/app/options/options.go, line 38 at r2 (raw file): Previously, gaocegege (Ce Gao) wrote…
Thanks. pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, gaocegege (Ce Gao) wrote…
Exactly. I'd be fine merging that snippet with TODOs to flush out the implementation. Comments from Reviewable |
Signed-off-by: Ce Gao <gaoce@caicloud.io>
pkg/controller.v2/controller_pod.go, line 94 at r7 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
I added the snippet to the code and add a TODO in reconcilePods, PTAL Comments from Reviewable |
@gaocegege Great job! Looking forward to seeing this PR merged. |
Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* *: Add implementation for v1alpha2 Signed-off-by: Ce Gao <gaoce@caicloud.io> * linter: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io> * options: Remove default config for kubeconfig Signed-off-by: Ce Gao <gaoce@caicloud.io> * service: Add const Signed-off-by: Ce Gao <gaoce@caicloud.io> * server.go: Remove new lines Signed-off-by: Ce Gao <gaoce@caicloud.io> * *: Fix import styles Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller.go: Add comment for handler Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller_pod: Fix name Signed-off-by: Ce Gao <gaoce@caicloud.io> * options: Remove kubeconfig Signed-off-by: Ce Gao <gaoce@caicloud.io> * *: Update Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller_pod: Add TODO Signed-off-by: Ce Gao <gaoce@caicloud.io>
* *: Add implementation for v1alpha2 Signed-off-by: Ce Gao <gaoce@caicloud.io> * linter: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io> * options: Remove default config for kubeconfig Signed-off-by: Ce Gao <gaoce@caicloud.io> * service: Add const Signed-off-by: Ce Gao <gaoce@caicloud.io> * server.go: Remove new lines Signed-off-by: Ce Gao <gaoce@caicloud.io> * *: Fix import styles Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller.go: Add comment for handler Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller_pod: Fix name Signed-off-by: Ce Gao <gaoce@caicloud.io> * options: Remove kubeconfig Signed-off-by: Ce Gao <gaoce@caicloud.io> * *: Update Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller_pod: Add TODO Signed-off-by: Ce Gao <gaoce@caicloud.io>
/assign @ddysher @DjangoPeng @jlewi @mitake @ScorpioCPH
This PR is the implementation of v1alpha2 and it is mainly contributed by @ScorpioCPH and @gaocegege
Signed-off-by: Ce Gao gaoce@caicloud.io
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)