You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Such as an environment.yaml file for Conda, if possible.
It seems that there are some issues with my environment, preventing me from starting the training properly.
I can train normally on my other codes and testing process runs fine, but there seem to be some bugs on training:
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
out = self.encoder(out, xs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
x = self.layers[i](x)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 244, in forward
r, _ = self.attn(inputs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 85, in forward
q = torch.matmul(attn, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
or
packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
r = self.ffn(r)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 183, in forward
x2 = x * torch.sigmoid(w)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
or
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
out = self.encoder(out, xs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
x = self.layers[i](x)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
r = self.ffn(r)
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 178, in forward
x = F.gelu(x)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
or
File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
This should be unrelated to memory and batch size since I still encounter the issue even with the smallest model.
Could you kindly update your environment configuration? It also might be related to the versions of torch and cudnn.
Well, I found that if I adjust batch_size to 4 and use 4 gpus it can run, but the cuda memory only takes up 1/6, and if I adjust it to 8 or bigger, an error will be reported.
Such as an environment.yaml file for Conda, if possible.
It seems that there are some issues with my environment, preventing me from starting the training properly.
my environment(2080ti x 8):
I can train normally on my other codes and testing process runs fine, but there seem to be some bugs on training:
or
or
or
This should be unrelated to memory and batch size since I still encounter the issue even with the smallest model.
Could you kindly update your environment configuration? It also might be related to the versions of torch and cudnn.
Or may stem from the use of this:sync_batchnorm
The text was updated successfully, but these errors were encountered: