Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in training on MNIST #5

Open
Boese0601 opened this issue Jun 16, 2022 · 8 comments
Open

Error in training on MNIST #5

Boese0601 opened this issue Jun 16, 2022 · 8 comments

Comments

@Boese0601
Copy link

Boese0601 commented Jun 16, 2022

Hi!

When I was training on MNIST with command:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni

I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?

Thanks!

@AlexiaJM
Copy link
Collaborator

It contains the metrics over time. I'm not sure why it would do that. 🤔

Check that the folders exists: /cluster/51/dichang/datasets/mcvd and /cluster/51/dichang/datasets/mcvd/smmnist_cat/log.
Make sure that the ninja package is installed properly.

@voletiv
Copy link
Owner

voletiv commented Jun 17, 2022

I agree, I don't think it has to do with the metrics. Check the data folder exists, and check your ninja installation. Maybe install ninja at the end.

@Boese0601
Copy link
Author

Boese0601 commented Jun 18, 2022

Could you please tell me the pytorch and ninja version you're using for training? Thanks.

From my side it doesn't work on torch==1.11.0 and ninja==1.10.2.3

But when I use torch on cpu, it works.

@dhruv-nathawani
Copy link

Same issue!

@voletiv
Copy link
Owner

voletiv commented Jul 14, 2022

I'm using ninja==1.10.2.3, torch==1.10.0 on my local machine with CPU, and torch==1.11.0 with GPUs. In both cases, training works.

@dhruv-nathawani
Copy link

Could you please tell us the CUDA version and the type of GPUs you are using?

@1094724913
Copy link

Could you please tell us the CUDA version and the type of GPUs you are using?

I'm usingCUDA==11.3,torch==1.11.0,GPU is NVIDIA RTX3090Ti.
while training,the same issue was encountered . What should I do? Thank you

@AlexiaJM
Copy link
Collaborator

AlexiaJM commented Aug 4, 2023

It seems like other people have had similar issues and they propose some solutions, see: mapillary/inplace_abn#104 and mapillary/inplace_abn#106 (comment).

I really don't know what to do with ninja or even what it does. 😞 I hope that some of these proposed solutions can work for you. If you find a solution to this problem, let us know and we can mention it in the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants