Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation failed #54

Open
ppphhhleo opened this issue May 24, 2022 · 16 comments
Open

Memory allocation failed #54

ppphhhleo opened this issue May 24, 2022 · 16 comments

Comments

@ppphhhleo
Copy link

I tried to train with 2 GPUs by docker, but after one epoch, memory errors in allocation occur. I am not sure what to check and what's wrong possibly.
image

@TomonobuTsujikawa
Copy link
Contributor

Thank you for creating, please let me confirm basics at first.

  1. What GPU do you use? It would be helpful if you could provide the result of nvidia-smi.
  2. Is there any logs such as openmpi library is not found before start training?
  3. Do you get same error message even if you use smaller batch size? Does this error always appear on 2nd epoch?

@15755841658
Copy link

2022-06-09 11-19-39屏幕截图
hello, why the g_loss_con is 0.0000 (0.0000) all time ???

@15755841658
Copy link

@TomonobuTsujikawa
Copy link
Contributor

Thank you for reporting, please let us check it.

@TomonobuTsujikawa
Copy link
Contributor

I have 11GB memory GPU, so I tried to run NVCNet on this environment.
At first, I couldn't run this model due to memory allocation error, so I had to reduce batch_size to 2.
After that, training is started correctly, but g_loss_con is 0 as you pointed.

Now, I'm confirming about g_loss_con.

@15755841658
Copy link

@TomonobuTsujikawa
please, i want to train for multi-GPU,but meet the question:
(tts_nnabla) twu@durian:/qwork4/twu/off_nvcnet$ mpirun -n 2 python main.py -c cudnn -d 0,2 --output_path log_new/baseline --batch_size 8
2022-08-24 16:29:11,963 [nnabla][INFO]: Initializing CPU extension...
2022-08-24 16:29:11,971 [nnabla][INFO]: Initializing CPU extension...
2022-08-24 16:29:12,607 [nnabla][INFO]: Initializing CUDA extension...
2022-08-24 16:29:12,607 [nnabla][INFO]: Initializing CUDA extension...
2022-08-24 16:29:25,542 [nnabla][INFO]: Initializing cuDNN extension...
value error in query
/home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70
Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same.
2022-08-24 16:29:25,558 [nnabla][INFO]: Initializing cuDNN extension...
value error in query
/home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70
Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same.
2022-08-24 16:29:26,010 [nnabla][INFO]: Training data with 103 speakers.
2022-08-24 16:29:26,011 [nnabla][INFO]: DataSource with shuffle(True)
2022-08-24 16:29:26,015 [nnabla][INFO]: Training data with 103 speakers.
2022-08-24 16:29:26,016 [nnabla][INFO]: DataSource with shuffle(True)
2022-08-24 16:29:26,025 [nnabla][INFO]: Using DataIterator
2022-08-24 16:29:26,030 [nnabla][INFO]: Using DataIterator
Running epoch=1 lr=0.00010
Failed to allocate. Freeing memory cache and retrying.
Failed to allocate. Freeing memory cache and retrying.
Failed to allocate again.
Error during forward propagation:
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
AddScalarCuda
AveragePoolingCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
Add2CudaCudnn
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
PowScalarCuda
AddScalarCuda
SumCuda
PowScalarCuda
Div2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
ConvolutionCudaCudnn
PowScalarCuda
ConvolutionCudaCudnn
PowScalarCuda
Add2CudaCudnn
PowScalarCuda
BatchMatmulCuda
MulScalarCuda
AddScalarCuda
LogCuda
Callback
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
MulScalarCuda
ExpCuda
RandnCuda
Mul2Cuda
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn <-- ERROR
Traceback (most recent call last):
File "main.py", line 99, in
run(args)
File "main.py", line 70, in run
Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run()
File "/qwork4/twu/off_nvcnet/train.py", line 156, in run
self.train_on_batch(i)
File "/qwork4/twu/off_nvcnet/train.py", line 185, in train_on_batch
p['d_loss'].forward()
File "_variable.pyx", line 582, in nnabla._variable.Variable.forward
RuntimeError: memory error in alloc
/home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/src/nbla/memory/memory.cpp:39
Failed this->alloc_impl(): N4nbla10CudaMemoryE allocation failed.

and I test the environment about:python -c "import nnabla_ext.cuda, nnabla_ext.cudnn" :

2022-08-24 16-35-46屏幕截图

@TomonobuTsujikawa
Copy link
Contributor

Please provide me the results of following command:

pip list | grep -e pip -e nnabla

You can import nnabla correctly on single GPU environment, so I think it is a setup issue for multi GPUs.

@15755841658
Copy link

@TomonobuTsujikawa
the result:
image

@TomonobuTsujikawa
Copy link
Contributor

Hmm, it seems to be ok.

Do you still have same error if you do the following?

pip uninstall nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
pip install nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
mpirun -n 2 python main.py -c cudnn -d 0,1 --output_path log_new/baseline --batch_size 8

I will also check.

@15755841658
Copy link

@TomonobuTsujikawa
it still have same error:
(tts_nnabla) twu@durian:/qwork4/twu/nvcnet_offi$ mpirun -n 2 python main.py -c cudnn -d 0,1 --output_path log_new/baseline --batch_size 8
2022-08-29 17:52:27,939 [nnabla][INFO]: Initializing CPU extension...
2022-08-29 17:52:27,939 [nnabla][INFO]: Initializing CPU extension...
2022-08-29 17:52:30,726 [nnabla][INFO]: Initializing CUDA extension...
2022-08-29 17:52:30,727 [nnabla][INFO]: Initializing CUDA extension...
/qwork4/twu/miniconda/envs/tts_nnabla/bin/../lib/libmpi.so: undefined symbol: ompi_mpi_op_no_op
/qwork4/twu/miniconda/envs/tts_nnabla/bin/../lib/libmpi.so: undefined symbol: ompi_mpi_op_no_op
2022-08-29 17:52:43,731 [nnabla][INFO]: Initializing cuDNN extension...
2022-08-29 17:52:44,080 [nnabla][INFO]: Training data with 103 speakers.
2022-08-29 17:52:44,081 [nnabla][INFO]: DataSource with shuffle(True)
2022-08-29 17:52:44,100 [nnabla][INFO]: Using DataIterator
2022-08-29 17:52:44,716 [nnabla][INFO]: Initializing cuDNN extension...
2022-08-29 17:52:45,076 [nnabla][INFO]: Training data with 103 speakers.
2022-08-29 17:52:45,076 [nnabla][INFO]: DataSource with shuffle(True)
2022-08-29 17:52:45,103 [nnabla][INFO]: Using DataIterator
value error in query
/home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70
Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same.
Running epoch=1 lr=0.00010
Failed to allocate. Freeing memory cache and retrying.
Failed to allocate. Freeing memory cache and retrying.
Failed to allocate again.
Error during forward propagation:
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
AddScalarCuda
AveragePoolingCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
ArangeCuda
ReshapeCuda
StackCuda
GatherNdCuda
Constant
SigmoidCrossEntropyCuda
MeanCudaCudnn
Add2CudaCudnn
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
PowScalarCuda
AddScalarCuda
SumCuda
PowScalarCuda
Div2Cuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
RandintCuda
MulScalarCuda
AddScalarCuda
Mul2Cuda
RandCuda
Mul2Cuda
PadCuda
ConvolutionCudaCudnn
PowScalarCuda
ConvolutionCudaCudnn
PowScalarCuda
Add2CudaCudnn
PowScalarCuda
BatchMatmulCuda
MulScalarCuda
AddScalarCuda
LogCuda
Callback
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
AveragePoolingCudaCudnn
LeakyReLUCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
MulScalarCuda
ExpCuda
RandnCuda
Mul2Cuda
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
GELUCuda
WeightNormalizationCuda
DeconvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
PadCuda
WeightNormalizationCuda
ConvolutionCudaCudnn
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn
SliceCuda
TanhCudaCudnn
SliceCuda
SigmoidCudaCudnn
Mul2Cuda
WeightNormalizationCuda
ConvolutionCudaCudnn
Add2CudaCudnn <-- ERROR
Traceback (most recent call last):
File "main.py", line 99, in
run(args)
File "main.py", line 70, in run
Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run()
File "/qwork4/twu/nvcnet_offi/train.py", line 156, in run
self.train_on_batch(i)
File "/qwork4/twu/nvcnet_offi/train.py", line 185, in train_on_batch
p['d_loss'].forward()
File "_variable.pyx", line 582, in nnabla._variable.Variable.forward
RuntimeError: memory error in alloc
/home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/src/nbla/memory/memory.cpp:39
Failed this->alloc_impl(): N4nbla10CudaMemoryE allocation failed.

@TomonobuTsujikawa
Copy link
Contributor

@15755841658
Thank you for testing.

I setup many environments to reproduce this error today, but I could not reproduce.
Can you provide your environment information a bit more?
If you would like to run nvcnet on docker, please show me the Dockerfile.
The following log will be very big, so I would appreciate it if you could put compressed log.

cat /etc/os-release
dpkg -l | grep ^ii
conda --version
conda list
pip --version
pip list
nvidia-smi
set | grep -e LD_LIBRARY -e LD_PRELOAD
find /usr -name libmpi.so\*

I think this is the minimum command if issue has been resolved.

mpirun -n 2 python -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)"

@15755841658
Copy link

ok! I will test.
but test the minimum command:
image

@TomonobuTsujikawa
Copy link
Contributor

Yes, your environment has issue, so minimum command fails.
Please provide information what I wrote.

@15755841658
Copy link

@TomonobuTsujikawa
OK,Thanks. I have emailed you. Please read the attachment for the results of these commands.
And please tell me what's wrong at the back.
Thanks very much.

@TomonobuTsujikawa
Copy link
Contributor

TomonobuTsujikawa commented Aug 30, 2022

@15755841658

I checked your environment information, here is the list of problems need to be solved.

  • Ubuntu16: Official support is ubuntu18 and later. This is because the many packages are really old on ubuntu16.
  • openmpi1: openmpi1 is not supported. I recommend to use openmpi v3 as of now (you can still use openmpi v2).
  • pip: If you use conda environment, pip must use conda's pip, otherwise python package management will conflict.
    Your pip seems to be conda-base. I'm sorry.

I cannot find nvidia driver/cuda/cudnn packages in your dpkg list, but you installed them by manually?
Also, there seems to be a new mpi on /usr/local, but you cannot use it due to permission denied.

Hmm, if you cannot upgrade OS environment, I think it is better to use docker container.
Here is example:

docker pull nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0
docker run --rm -it -u $(id -u):$(id -g) --gpus all nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0

mpirun -n 2 python3 -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)"

If you cannot install docker, you need to build openmpi by yourself.
This is how to build openmpi, but some setup might be different since OS versions are different:
https://github.com/sony/nnabla-ext-cuda/blob/v1.29.0/docker/release/Dockerfile.cuda-mpi#L54-L86

Also, please refer nnabla install page: https://nnabla.org/install/
This page provides the list of install components and how to install.

@gl8-mt
Copy link

gl8-mt commented Nov 10, 2022

I had same trouble when i tied to setup another code repo, env:

  • numpy==1.22.4
  • docker with cuda 11.6
  • os: ubuntu-18.04

after i install numpy>=1.23.0, the problem is fixed. however, some warnings showed up, such as:

...
2022-11-10 11:54:56,668 [nnabla][INFO]: Initializing CUDA extension...
<frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 232 from PyObject
...

hope helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants