-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in training on MNIST #5
Comments
It contains the metrics over time. I'm not sure why it would do that. 🤔 Check that the folders exists: /cluster/51/dichang/datasets/mcvd and /cluster/51/dichang/datasets/mcvd/smmnist_cat/log. |
I agree, I don't think it has to do with the metrics. Check the data folder exists, and check your ninja installation. Maybe install ninja at the end. |
Could you please tell me the pytorch and ninja version you're using for training? Thanks. From my side it doesn't work on torch==1.11.0 and ninja==1.10.2.3 But when I use torch on cpu, it works. |
Same issue! |
I'm using ninja==1.10.2.3, torch==1.10.0 on my local machine with CPU, and torch==1.11.0 with GPUs. In both cases, training works. |
Could you please tell us the CUDA version and the type of GPUs you are using? |
I'm usingCUDA==11.3,torch==1.11.0,GPU is NVIDIA RTX3090Ti. |
It seems like other people have had similar issues and they propose some solutions, see: mapillary/inplace_abn#104 and mapillary/inplace_abn#106 (comment). I really don't know what to do with ninja or even what it does. 😞 I hope that some of these proposed solutions can work for you. If you find a solution to this problem, let us know and we can mention it in the README. |
Hi!
When I was training on MNIST with command:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni
I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?
Thanks!
The text was updated successfully, but these errors were encountered: