Error in training on MNIST #5

Boese0601 · 2022-06-16T19:46:42Z

Hi!

When I was training on MNIST with command:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni

I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?

Thanks!

The text was updated successfully, but these errors were encountered:

AlexiaJM · 2022-06-17T17:11:24Z

It contains the metrics over time. I'm not sure why it would do that. 🤔

Check that the folders exists: /cluster/51/dichang/datasets/mcvd and /cluster/51/dichang/datasets/mcvd/smmnist_cat/log.
Make sure that the ninja package is installed properly.

voletiv · 2022-06-17T17:48:28Z

I agree, I don't think it has to do with the metrics. Check the data folder exists, and check your ninja installation. Maybe install ninja at the end.

Boese0601 · 2022-06-18T11:31:49Z

Could you please tell me the pytorch and ninja version you're using for training? Thanks.

From my side it doesn't work on torch==1.11.0 and ninja==1.10.2.3

But when I use torch on cpu, it works.

dhruv-nathawani · 2022-07-14T23:44:12Z

Same issue!

voletiv · 2022-07-14T23:56:29Z

I'm using ninja==1.10.2.3, torch==1.10.0 on my local machine with CPU, and torch==1.11.0 with GPUs. In both cases, training works.

dhruv-nathawani · 2022-07-15T00:59:43Z

Could you please tell us the CUDA version and the type of GPUs you are using?

1094724913 · 2023-08-04T13:51:28Z

Could you please tell us the CUDA version and the type of GPUs you are using?

I'm usingCUDA==11.3,torch==1.11.0,GPU is NVIDIA RTX3090Ti.
while training，the same issue was encountered . What should I do? Thank you

AlexiaJM · 2023-08-04T14:07:40Z

It seems like other people have had similar issues and they propose some solutions, see: mapillary/inplace_abn#104 and mapillary/inplace_abn#106 (comment).

I really don't know what to do with ninja or even what it does. 😞 I hope that some of these proposed solutions can work for you. If you find a solution to this problem, let us know and we can mention it in the README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in training on MNIST #5

Error in training on MNIST #5

Boese0601 commented Jun 16, 2022 •

edited

Loading

AlexiaJM commented Jun 17, 2022

voletiv commented Jun 17, 2022

Boese0601 commented Jun 18, 2022 •

edited

Loading

dhruv-nathawani commented Jul 14, 2022

voletiv commented Jul 14, 2022 •

edited

Loading

dhruv-nathawani commented Jul 15, 2022

1094724913 commented Aug 4, 2023

AlexiaJM commented Aug 4, 2023

Error in training on MNIST #5

Error in training on MNIST #5

Comments

Boese0601 commented Jun 16, 2022 • edited Loading

AlexiaJM commented Jun 17, 2022

voletiv commented Jun 17, 2022

Boese0601 commented Jun 18, 2022 • edited Loading

dhruv-nathawani commented Jul 14, 2022

voletiv commented Jul 14, 2022 • edited Loading

dhruv-nathawani commented Jul 15, 2022

1094724913 commented Aug 4, 2023

AlexiaJM commented Aug 4, 2023

Boese0601 commented Jun 16, 2022 •

edited

Loading

Boese0601 commented Jun 18, 2022 •

edited

Loading

voletiv commented Jul 14, 2022 •

edited

Loading