Pytorchjob dist-mnist no training logs #1601

Findlazyfriend · 2022-05-30T08:52:38Z

Hello, guys
As a novice, I encountered a seemingly simple problem. When executing examples/pytorch/mnist/mnist.py, I found that there is no log information after downloading the data, but the master and worker are always running Status, the training process can be displayed normally when debugging locally. This question may seem a bit stupid, I hope you can give pointers, thank you very much.
All outputs :

Using CUDA
Using distributed PyTorch with nccl backend
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Processing...
Done!

johnugeorge · 2022-06-02T06:48:31Z

Do you GPUs with in your cluster?

Can you try Gloo backend?

Findlazyfriend · 2022-06-06T06:34:13Z

Do you GPUs with in your cluster?

Can you try Gloo backend?

Thank you very much for your reply, the problem has been solved, just always stuck in model.cuda(). It is the problem of the version in my base image.😊

johnugeorge · 2022-06-06T07:51:30Z

Thanks.
Closing this issue

N-Kingsley · 2022-06-15T17:51:22Z

Hello, guys As a novice, I encountered a seemingly simple problem. When executing examples/pytorch/mnist/mnist.py, I found that there is no log information after downloading the data, but the master and worker are always running Status, the training process can be displayed normally when debugging locally. This question may seem a bit stupid, I hope you can give pointers, thank you very much. All outputs :

Using CUDA Using distributed PyTorch with nccl backend Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz Processing... Done!

I have met the same problem, could you tell me what happened and how to solve?

johnugeorge closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorchjob dist-mnist no training logs #1601

Pytorchjob dist-mnist no training logs #1601

Findlazyfriend commented May 30, 2022

johnugeorge commented Jun 2, 2022

Findlazyfriend commented Jun 6, 2022 •

edited

Loading

johnugeorge commented Jun 6, 2022

N-Kingsley commented Jun 15, 2022

Pytorchjob dist-mnist no training logs #1601

Pytorchjob dist-mnist no training logs #1601

Comments

Findlazyfriend commented May 30, 2022

johnugeorge commented Jun 2, 2022

Findlazyfriend commented Jun 6, 2022 • edited Loading

johnugeorge commented Jun 6, 2022

N-Kingsley commented Jun 15, 2022

Findlazyfriend commented Jun 6, 2022 •

edited

Loading