-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Related to DLRM/Pytorch
Describe the bug
Changed embedding size to 64 (default 128)
Changed the last layer of bottom MLP size to 64 (default 128)
This caused crash as shown below.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/workspace/dlrm/dlrm/scripts/main.py", line 519, in <module>
app.run(main)
File "/opt/conda/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/opt/conda/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/workspace/dlrm/dlrm/scripts/main.py", line 264, in main
train(model, loss_fn, optimizer, data_loader_train, data_loader_test, scaled_lr)
File "/workspace/dlrm/dlrm/scripts/main.py", line 361, in train
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 184, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 123, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at ../aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7ff5f440a82b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x327d0c2 (0x7ff4bbe1c0c2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xb82 (0x7ff4bbe1d9d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x326945f (0x7ff4bbe0845f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar, c10::Scalar) + 0x78e (0x7ff4bacef5ee in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0x15b (0x7ff4bacf04bb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x3293808 (0x7ff4bbe32808 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x330f734 (0x7ff4bbeae734 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x2ba029b (0x7ff537b0b29b in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x7a8224 (0x7ff535713224 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const + 0xc5 (0x7ff5c6f346e5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x28fe447 (0x7ff537869447 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::generated::AddmmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x155 (0x7ff5378aeca5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x2ee2f75 (0x7ff537e4df75 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1808 (0x7ff537e48f68 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x551 (0x7ff537e49e01 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0xa3 (0x7ff537e3f863 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x50 (0x7ff5c7236b20 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0xbd6df (0x7ff5f4af76df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #19: <unknown function> + 0x76db (0x7ff5fffcf6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #20: clone + 0x3f (0x7ff5ffcf888f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7ff5f440a82b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7ff5f41a5500 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff5f43f2c9d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x59f1e2 (0x7ff5c724b1e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #16: __libc_start_main + 0xe7 (0x7ff5ffbf8b97 in /lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007ff59fda0700 (most recent call first):
Thread 0x00007ff56b58b700 (most recent call first):
Current thread 0x00007ff6003fc740 (most recent call first):
Aborted
To Reproduce
use the command line:
--embedding_dim 64 --bottom_mlp_sizes 512,256,64
Expected behavior
it should not crash.
Environment
Please provide at least:
- Container version (e.g. pytorch:20.06-py3):
- GPUs in the system: (e.g. 1x Tesla V100 32GB):
- CUDA driver version (e.g. 418.67):
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working