-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorchjob dist-mnist no training logs #1601
Comments
Do you GPUs with in your cluster? Can you try Gloo backend? |
Thank you very much for your reply, the problem has been solved, just always stuck in model.cuda(). It is the problem of the version in my base image.😊 |
Thanks. |
I have met the same problem, could you tell me what happened and how to solve? |
Hello, guys
As a novice, I encountered a seemingly simple problem. When executing examples/pytorch/mnist/mnist.py, I found that there is no log information after downloading the data, but the master and worker are always running Status, the training process can be displayed normally when debugging locally. This question may seem a bit stupid, I hope you can give pointers, thank you very much.
All outputs :
Using CUDA
Using distributed PyTorch with nccl backend
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Processing...
Done!
The text was updated successfully, but these errors were encountered: