-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
Comments
It should not affect the logic since we will retry. Could you please show us the full log? Which version are you using? |
Yes, it tries to "Reconciling", but the job pods hang on "pending" status, and there is no any logs about this job on volcano-scheduer pod. |
May I ask which version you are using? |
The container image of training operator is "public.ecr.aws/j1r0q0g6/training/training-operator:760ac1171dd30039a7363ffa03c77454bd714da5". I am not sure if it is okay to check the version, or any other way to get the version? |
/cc @Jeffwan @PatrickXYS Do you know about it? Are you using it in AWS? |
If it is not the official version, I can change it with official version and try it again. |
Is there any pod created? |
Can you please show |
I could find any clue from pod describe. It is different to attatch large image directly for me, please open it with link url. master-0 pod: worker-0 pod: |
I did not understand why there is no event of the pods. It's weird. But I do not think it is related to the operator. Seems that the pods are already created. |
Yes, it is weird. Some things I noticed may be helpful:
|
Personally, I think it is may be related to volcano. Since it can work after restarting the volcano. |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
After creating Pytorch Job, the stauts of the job pods will always be pending, and the training-operator controller throws error as below:
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org "xxx-pytorchjob": the object has been modifyed; please apply your changes to the latest version and try again
The gang schedule is enabled with volcano.
The text was updated successfully, but these errors were encountered: