You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster?
PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution.
Thanks in advance~ any hints would be helpful to me.
The text was updated successfully, but these errors were encountered:
@ThomaswellY If you would run torchrun, you should open an issues at training-operator repo. If you would run Distributed Pytorch Training with mpirun, we can answer your questions at this repo.
@alculquicondor@tenzen-y
I was looking to how to modify the original script which originally use torch.distributed.launch to start training to use mpirun to start training in mpi-operator.
I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster?
PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution.
Thanks in advance~ any hints would be helpful to me.
The text was updated successfully, but these errors were encountered: