-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributive Gloo PyTorchJob example doesn't work #1358
Comments
cc @kubeflow/wg-training-leads |
I can reproduce the issue. I ssh to pods and looks env injection is correct.
Let me double check port and service setting Update: rank setting is incorrect
|
Figure out the issue. Originally, it uses real In kubeflow/common mode, it passes PyTorch was not migrated the kubeflow/common so this is the first time we see the issue /cc @zw0610 |
This is misleading we mix use Another way is to manually covert |
Sounds good, thank you for the investigation @Jeffwan! |
I determine to change @andreyvelich Please use this image for testing. |
/priority p0 |
@Jeffwan Yes, this image is working. |
fix has been merged and manifest is up to date. We can close the issue. |
/kind bug
I tried to run this PyTorch distributive Gloo backend example with the new Training Controller.
It stuck at this step: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L132
This is example YAML:
This is source code for this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py.
This is output of
kubectl describe pod pytorch-master-0 -n kubeflow
This is output of
kubectl describe pod pytorch-worker-0 -n kubeflow
.The text was updated successfully, but these errors were encountered: