-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some suggestions about engineering optimization #1703
Comments
Thanks @HeGaoYuan for being the power user of Training operator. We really appreciate your feedback.
We love contributions and it would be great if you can upstream some of these fixes. |
/cc @kubeflow/wg-training-leads @kubeflow/common-team /cc @alculquicondor |
@johnugeorge Thanks for your quick and constructive reply.
training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go Lines 385 to 389 in 82af677
|
To your point3, Currently Job goes into running state if at least one pod is running. (Master in case of master-worker or any worker in case of only worker case) To handle your case of one worker pod getting blocked, using gang scheduler like volcano will help ? |
I think gang scheduler can only guarantee there are enough resources for all Pods of Job to schedule. After all pods of Job are scheduled, some pod maybe blocking for many reasons, for example, the volume of one Pod may set up failed. |
Re 2: The mpi-operator has minimal usage of kf/common. Primarily, it just uses constants. So I would generally welcome cleanups in the repo. Keep me in the loop, please.
While it could be useful for some, I have anecdotal evidence that HPC users would prefers a simple controller for their MPI workloads, without having to install all of kf. Still, this is something we can discuss. |
@alculquicondor I was referring to training-operator, not all of KF |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, Kubeflow team
Thanks for your great work firstly. I have used your training-operator in production environment for a period of time. And I found some engineering problems during my using.
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: