-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA support for PyTorch Elastic #1751
Comments
Which image are you using ? Is this always reproducible? |
build image myself from the master branch. yes |
Does this happen when you get into a situation with "NotEnounghResources" ?What about the normal situations? Is it the situation that underlying elastic job got completed but Pytorchjob doesn't show succeeded ? |
1.yes. |
When we use PyTorch Elastic(eg. https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml):
Support successPolicy on pytorchjob can fix it:
|
This is related to bug in reporting success for Elastic runs. Related: #1711 (comment) |
For Elastic mode, if worker0 completes, we default to setting the pytorchjob to Succeeded? |
Yes. I think, if any worker succeeds, we can mark it succeeded. Worker 0 is safer as it runs the c10d |
Yes. |
@tsuiot Is there any problem with the test?
|
1.the pytorchJob change to Succeeded with #1752 |
PytorchJob.RunPolicy.CleanPodPolicy defines the policy to kill pods after the job completes.(the deafut value is None for PytorchJob). training-operator/pkg/apis/kubeflow.org/v1/pytorch_defaults.go Lines 66 to 72 in aae672f
But in the comments of common, the default value is Running, which is ambiguous.
@tsuiot you can set PytorchJob.RunPolicy.CleanPodPolicy to Running and try again. @johnugeorge should we set pytorjob CleanPodPolicy default to Running? |
/LGTM |
We can fix the comment for now. It is better to keep consistent for all jobs. |
1. Background
use training-operator/examples/pytorch/elastic/imagenet/imagenet.yaml to HPA for Pytorch elastic
2.ScaleUp and PytorchJob always in running state
Jobs list:
PytorchJob.Status:
HPA.status:
in this case, worker3 which scale up by HPA is pendding first, due to NotEnounghResources, and scheduled until worker0-2 is Successed。 this let PytorchJob always in Runing state, and workder3 in Running and Restart to communicate to worker0
The text was updated successfully, but these errors were encountered: