Some suggestions about engineering optimization #1703

HeGaoYuan · 2022-12-22T12:13:17Z

Hi, Kubeflow team
Thanks for your great work firstly. I have used your training-operator in production environment for a period of time. And I found some engineering problems during my using.

The log format is mixed up. There are at least three different log format/implementation: the controller-runtime itself, the training-operator project, the kubeflow/common module.
The whole architecture is not clean. Specifically, according to what I know, the kubeflow/common module was designed for the old Controller architecture, now you are using controller-runtime to implement the all kinds of Job Controller, but you don't reconstruct the kubelow/common, this cause so many redundant and inconsistent codes.
The conditions of job status changing are not told to users. Specifically, in real usage, the changing of job status is very import. Users usually want to know clearly when the job will failed, when the job should get started, all this details are not told to users, users should to read codes to recognize it. Additionally, whether the implementation of this part is totally right should be discussed.
There are some other bugs and optimizations that I can make specific issues. For example, the default worker thread num in any Job Controller is 1 and can't be changed, which is unacceptable in production environment.

Looking forward to your reply.

johnugeorge · 2022-12-22T12:46:58Z

Thanks @HeGaoYuan for being the power user of Training operator. We really appreciate your feedback.

Agree. This needs to be fixed.
The motivation for kubeflow common was to bring a common framework when there were multiple operator repos that tried to use common features. It is true that it is harder to keep consistent. However, it is currently been used by training operator and v2 mpi operator repos. Hence, I am not sure if we have a better solution. In the future when v2 MPI operator is merged into training operator(?), we can think of merging common code into training operator as well. Thoughts?
Can you explain which changes in JobStatus you are referring to? What extra details do you expect?
We are happy to hear more and and please file issues. We can expose parameters if need to be configurable.

We love contributions and it would be great if you can upstream some of these fixes.

johnugeorge · 2022-12-22T12:50:14Z

/cc @kubeflow/wg-training-leads @kubeflow/common-team

/cc @alculquicondor

HeGaoYuan · 2022-12-22T13:19:54Z

@johnugeorge Thanks for your quick and constructive reply.

This question is hard to answer. I will reply it later.
There are 5 JobConditionType( Created, Running, Restarting, Succeeded, Failed), it is like a state machine. Users always want to know the condition of state transition(Yes, it is thestate transition table). For example, when a Job is transforming form Created to Running. First, we missed the state transition table as a document. Second, there are maybe some bugs in the implementations. For instance, in the blew codes, the PytorchJob will transform from others( mostly is Created) to Running when the PytorchJob include master and the master Pod is running. But in real case, only when master and all workers pod are running, then the PytorchJob is entering real training. I test this at least in pytorch 1.9. So, image this situation, there is a PytorchJob with one master and three workers, if the master pod is running but one worker pod is block ing for some reasons, the PytorchJob will transform to Running state, but actually it is not training at all.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Lines 385 to 389 in 82af677

    
           if ContainsMasterSpec(replicas) { 
        
           	if rtype == commonv1.ReplicaType(kubeflowv1.PyTorchJobReplicaTypeMaster) { 
        
           		if running > 0 { 
        
           			msg := fmt.Sprintf("PyTorchJob %s is running.", pytorchjob.Name) 
        
           			err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRunning, commonutil.JobRunningReason, msg)

johnugeorge · 2022-12-22T13:40:00Z

To your point3,

Currently Job goes into running state if at least one pod is running. (Master in case of master-worker or any worker in case of only worker case)

To handle your case of one worker pod getting blocked, using gang scheduler like volcano will help ?

HeGaoYuan · 2022-12-22T13:49:05Z

I think gang scheduler can only guarantee there are enough resources for all Pods of Job to schedule. After all pods of Job are scheduled, some pod maybe blocking for many reasons, for example, the volume of one Pod may set up failed.

alculquicondor · 2022-12-22T15:01:36Z

Re 2: The mpi-operator has minimal usage of kf/common. Primarily, it just uses constants. So I would generally welcome cleanups in the repo. Keep me in the loop, please.

MPI operator is merged into training operator

While it could be useful for some, I have anecdotal evidence that HPC users would prefers a simple controller for their MPI workloads, without having to install all of kf. Still, this is something we can discuss.

johnugeorge · 2022-12-23T06:42:49Z

@alculquicondor I was referring to training-operator, not all of KF

github-actions · 2023-08-24T15:02:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-08-28T06:19:16Z

/remove-lifecycle stale

github-actions · 2023-11-26T10:01:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

HeGaoYuan mentioned this issue Dec 22, 2022

Inconsistent implementation about when the validation of job's spec failed #1704

Open

HeGaoYuan mentioned this issue Dec 23, 2022

Number of worker threads used by the controller can't be configured #1706

Closed

This was referenced Dec 26, 2022

The behavior is unexpected when replicas of job set to 0 #1709

Open

Inconsistent implementation in UpdateJobStatusInApiServer method #1710

Open

The Job condition's transition is not clear and has bugs #1711

Open

The whole architecture is not clean #1712

Open

johnugeorge mentioned this issue Dec 28, 2022

Merge Kubeflow/common to training operator #1714

Closed

johnugeorge added the area/1.7.0 label May 17, 2023

johnugeorge mentioned this issue May 22, 2023

[Release] Training operator 1.7.0 release #1809

Closed

8 tasks

github-actions bot added the lifecycle/stale label Aug 24, 2023

google-oss-prow bot removed the lifecycle/stale label Aug 28, 2023

tenzen-y mentioned this issue Sep 8, 2023

Deprecate MPIJob v1 #1906

Open

github-actions bot added the lifecycle/stale label Nov 26, 2023

johnugeorge added lifecycle/frozen and removed lifecycle/stale area/1.7.0 labels Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some suggestions about engineering optimization #1703

Some suggestions about engineering optimization #1703

HeGaoYuan commented Dec 22, 2022

johnugeorge commented Dec 22, 2022 •

edited

Loading

johnugeorge commented Dec 22, 2022 •

edited

Loading

HeGaoYuan commented Dec 22, 2022 •

edited

Loading

johnugeorge commented Dec 22, 2022

HeGaoYuan commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

johnugeorge commented Dec 23, 2022

github-actions bot commented Aug 24, 2023

tenzen-y commented Aug 28, 2023

github-actions bot commented Nov 26, 2023

Some suggestions about engineering optimization #1703

Some suggestions about engineering optimization #1703

Comments

HeGaoYuan commented Dec 22, 2022

johnugeorge commented Dec 22, 2022 • edited Loading

johnugeorge commented Dec 22, 2022 • edited Loading

HeGaoYuan commented Dec 22, 2022 • edited Loading

johnugeorge commented Dec 22, 2022

HeGaoYuan commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

johnugeorge commented Dec 23, 2022

github-actions bot commented Aug 24, 2023

tenzen-y commented Aug 28, 2023

github-actions bot commented Nov 26, 2023

johnugeorge commented Dec 22, 2022 •

edited

Loading

johnugeorge commented Dec 22, 2022 •

edited

Loading

HeGaoYuan commented Dec 22, 2022 •

edited

Loading